| # Math Question Classifier - Quick Start Guide |
|
|
| ## Execution Order |
|
|
| ### Setup (Blocks 1-7) |
| **Run once to setup environment and define classes** |
|
|
| 1. **Block 1**: Install packages |
| 2. **Block 2**: Import libraries |
| 3. **Block 3**: Set data path |
| 4. **Block 4**: Convert JSON to Parquet (one-time data preparation) |
| 5. **Block 5**: Define MathDatasetLoader class |
| 6. **Block 6**: Define MathFeatureExtractor class |
| 7. **Block 7**: Define MathQuestionClassifier class |
|
|
| ### Training & Evaluation (Blocks 8-13) |
| **Run to train and evaluate models** |
|
|
| 8. **Block 8**: Load dataset from Parquet files |
| 9. **Block 9**: Extract features (text preprocessing + math symbols + numeric) |
| 10. **Block 10**: Vectorize features (TF-IDF + scaling) |
| 11. **Block 11**: Train 5 models and compare performance |
| 12. **Block 12**: Detailed evaluation of best model |
| 13. **Block 13**: Complete test set analysis with 6 visualizations |
|
|
| --- |
|
|
| ## What Each Block Does |
|
|
| ### Block 1-3: Environment Setup |
| - Installs scikit-learn, pandas, matplotlib, seaborn, nltk |
| - Imports all necessary libraries |
| - Sets path to data directory (`./math`) |
|
|
| ### Block 4: Data Consolidation |
| **Purpose**: Convert JSON files to Parquet format |
| - **Input**: `./math/train/` and `./math/test/` folders with JSON files |
| - **Output**: `train.parquet` and `test.parquet` |
| - **Benefit**: 10-100x faster loading than JSON |
| - **Run**: Only once (skip if Parquet files already exist) |
|
|
| ### Block 5-7: Class Definitions |
| Define three main classes: |
| - **MathDatasetLoader**: Loads Parquet files, shows statistics |
| - **MathFeatureExtractor**: Cleans LaTeX, extracts math symbols, preprocesses text |
| - **MathQuestionClassifier**: Trains models, evaluates performance |
|
|
| ### Block 8: Load Data |
| - Loads `train.parquet` and `test.parquet` |
| - Shows class distribution for train and test sets |
| - Displays 2 bar charts (train/test distribution) |
|
|
| ### Block 9: Feature Extraction |
| Extracts three types of features: |
| 1. **Text features**: Preprocessed text (LaTeX cleaning, lemmatization) |
| 2. **Math symbol features**: 10 binary indicators (has_fraction, has_sqrt, etc.) |
| 3. **Numeric features**: 5 statistical measures (num_count, avg_number, etc.) |
|
|
| ### Block 10: Vectorization |
| - Creates TF-IDF features (5000 dimensions, trigrams) |
| - Scales additional features to [0,1] using MinMaxScaler |
| - **Critical**: Fits ONLY on training data (prevents data leakage) |
| - Converts to CSR format for efficient operations |
|
|
| ### Block 11: Model Training |
| Trains 5 optimized models: |
| 1. **Naive Bayes** (baseline) |
| 2. **Logistic Regression** (linear classifier) |
| 3. **SVM** (maximum margin) |
| 4. **Random Forest** (ensemble) |
| 5. **Gradient Boosting** (sequential ensemble) |
|
|
| **Output**: |
| - Comparison table with Accuracy, F1-Score, Training Time |
| - 2 bar charts comparing performance and speed |
| - Selects best model automatically |
|
|
| ### Block 12: Detailed Evaluation |
| - Confusion matrix visualization |
| - Classification report (precision, recall, F1 per class) |
| - Feature importance (for tree-based models) |
|
|
| ### Block 13: Complete Analysis |
| **Comprehensive evaluation on entire test set** |
|
|
| **6 Visualizations**: |
| 1. Confusion Matrix (absolute counts) |
| 2. Normalized Confusion Matrix (proportions) |
| 3. F1-Score by Topic (horizontal bar chart) |
| 4. Precision vs Recall (scatter plot, size = support) |
| 5. Test Set Distribution (bar chart) |
| 6. Confidence Distribution (histogram: correct vs incorrect) |
|
|
| **Analysis Sections**: |
| - Overall performance (accuracy, F1-score) |
| - Per-class metrics table |
| - Confusion pair analysis |
| - Summary statistics |
|
|
| --- |
|
|
| ## Expected Results |
|
|
| ### Model Performance (F1-Score) |
| - **Gradient Boosting**: 86-90% |
| - **Logistic Regression**: 85-89% |
| - **SVM**: 84-88% |
| - **Naive Bayes**: 78-82% |
| - **Random Forest**: 75-82% (expected to underperform on sparse features) |
|
|
| ### Training Time |
| - **Naive Bayes**: ~10 seconds |
| - **Logistic Regression**: ~30 seconds |
| - **SVM**: ~2 minutes |
| - **Random Forest**: ~3 minutes |
| - **Gradient Boosting**: ~5 minutes |
|
|
| ### Per-Topic Performance |
| **High Performance** (F1 > 90%): |
| - counting_and_probability |
| - number_theory |
| |
| **Medium Performance** (F1: 85-90%): |
| - geometry |
| - precalculus |
| |
| **Challenging** (F1: 80-85%): |
| - algebra โ intermediate_algebra (similar concepts) |
| - prealgebra โ algebra (overlapping operations) |
|
|
| --- |
|
|
| ## Key Design Decisions |
|
|
| ### 1. Data Leakage Prevention |
| **Critical**: TF-IDF vectorizer fitted ONLY on training data |
| ``` |
| Train/Test Split โ Fit Vectorizer on Train โ Transform Both |
| ``` |
| Without this, test vocabulary leaks into training, inflating performance by 1-3%. |
|
|
| ### 2. Feature Engineering |
| **Hybrid approach**: |
| - TF-IDF (5000 features): Captures text content |
| - Math symbols (10 features): Topic indicators (e.g., integrals โ calculus) |
| - Numeric features (5 features): Statistical properties |
|
|
| **Why no hand-crafted keywords?** |
| Avoided topic-specific keyword lists to prevent heuristic bias. Let the model learn discriminative vocabulary from data. |
|
|
| ### 3. Hyperparameter Optimization |
| All models use optimized parameters: |
| - **C=1.0** (SVM/Logistic): Balanced regularization |
| - **max_depth=30** (Random Forest): Sufficient complexity |
| - **subsample=0.8** (Gradient Boosting): Stochastic sampling prevents overfitting |
| |
| ### 4. Class Imbalance Handling |
| `class_weight='balanced'` automatically adjusts weights inversely proportional to class frequencies. |
| |
| --- |
| |
| ## Methodology |
| |
| ### Problem Type |
| **Supervised Multi-Class Text Classification** |
|
|
| **Why Classification (not Clustering)?** |
| - Categories are predefined and labeled |
| - Objective: Assign to known subtopic |
| - Not discovering latent groups |
| - Supervised learning with known labels |
|
|
| ### Pipeline |
| ``` |
| JSON Files |
| โ |
| Parquet Conversion (Block 4) |
| โ |
| Feature Extraction (Block 9) |
| โ |
| TF-IDF Vectorization (Block 10) |
| โ |
| Model Training (Block 11) |
| โ |
| Evaluation (Blocks 12-13) |
| ``` |
|
|
| ### Feature Vector |
| ``` |
| Total: 5015 dimensions |
| โโโ TF-IDF: 5000 (unigrams, bigrams, trigrams) |
| โโโ Math Symbols: 10 (binary indicators) |
| โโโ Numeric: 5 (scaled to [0,1]) |
| ``` |
|
|
| --- |
|
|
| ## Troubleshooting |
|
|
| ### "No data loaded" |
| **Solution**: Check data path in Block 3 |
| ```python |
| DATA_PATH = './math' # Adjust to your path |
| ``` |
|
|
| ### "NameError: name 'results' is not defined" |
| **Solution**: Run blocks in order. Block 12-13 need Block 11 first. |
|
|
| ### "ValueError: Negative values" |
| **Solution**: Block 10 should complete successfully. MinMaxScaler scales features to [0,1]. |
|
|
| ### "TypeError: coo_matrix not subscriptable" |
| **Solution**: Block 10 converts to CSR format. Ensure it runs completely. |
| |
| ### Model underperforms |
| **Check**: |
| 1. Data leakage prevented? (Vectorizer fitted on train only) |
| 2. Features extracted correctly? (Block 9 output) |
| 3. Class distribution balanced? (Block 8 charts) |
| |
| --- |
| |
| ## Performance Optimization |
| |
| ### Speed Up Training |
| ```python |
| # Reduce vocabulary |
| vectorizer_config = {'max_features': 2000} |
| |
| # Fewer trees |
| RandomForestClassifier(n_estimators=100) |
|
|
| # Fewer boosting rounds |
| GradientBoostingClassifier(n_estimators=50) |
| ``` |
| |
| ### Reduce Memory |
| ```python |
| # Smaller vocabulary |
| vectorizer_config = {'max_features': 3000} |
| |
| # Fewer n-grams |
| vectorizer_config = {'ngram_range': (1, 2)} |
| ``` |
| |
| --- |
| |
| ## Output Files |
| |
| After Block 13 completes, you'll have: |
| - **train.parquet**: Training data (consolidated) |
| - **test.parquet**: Test data (consolidated) |
| - Performance metrics and visualizations |
| - Model saved in memory (classifier.best_model) |
|
|
| --- |
|
|
| ## Next Steps |
|
|
| ### Save Model |
| Add after Block 13: |
| ```python |
| import pickle |
| model_data = { |
| 'model': classifier.best_model, |
| 'vectorizer': classifier.vectorizer, |
| 'scaler': classifier.scaler, |
| 'label_encoder': classifier.label_encoder |
| } |
| with open('model.pkl', 'wb') as f: |
| pickle.dump(model_data, f) |
| ``` |
|
|
| ### Batch Prediction |
| ```python |
| # Load model |
| with open('model.pkl', 'rb') as f: |
| model_data = pickle.load(f) |
| |
| # Predict |
| new_problems = ["Solve x^2 = 16", "Find area of circle"] |
| for problem in new_problems: |
| # Preprocess โ Extract features โ Predict |
| prediction = model.predict(...) |
| ``` |
|
|
| --- |
|
|
| ## Summary |
|
|
| **13 Blocks, 3 Stages**: |
| 1. **Setup** (Blocks 1-7): One-time environment setup |
| 2. **Training** (Blocks 8-11): Data loading and model training |
| 3. **Evaluation** (Blocks 12-13): Comprehensive analysis |
|
|
| **Key Features**: |
| - Data leakage prevention |
| - 5 optimized models |
| - 6 visualization types |
| - Probability predictions |
| - Error analysis |
|
|
| **Expected Time**: 10-15 minutes total (including training) |
|
|
| **Expected Performance**: 85-90% F1-score on test set |
|
|