Spaces:

NeerajCodz
/

aiMathQuestionClassification

Sleeping

App Files Files Community

aiMathQuestionClassification / TRAINING.md

NeerajCodz

Fresh start: Push all project files including models and notebooks

1d5f27f 3 months ago

preview code

raw

history blame contribute delete

8.42 kB

	# Math Question Classifier - Quick Start Guide

	## Execution Order

	### Setup (Blocks 1-7)
	Run once to setup environment and define classes

	1. Block 1: Install packages
	2. Block 2: Import libraries
	3. Block 3: Set data path
	4. Block 4: Convert JSON to Parquet (one-time data preparation)
	5. Block 5: Define MathDatasetLoader class
	6. Block 6: Define MathFeatureExtractor class
	7. Block 7: Define MathQuestionClassifier class

	### Training & Evaluation (Blocks 8-13)
	Run to train and evaluate models

	8. Block 8: Load dataset from Parquet files
	9. Block 9: Extract features (text preprocessing + math symbols + numeric)
	10. Block 10: Vectorize features (TF-IDF + scaling)
	11. Block 11: Train 5 models and compare performance
	12. Block 12: Detailed evaluation of best model
	13. Block 13: Complete test set analysis with 6 visualizations

	---

	## What Each Block Does

	### Block 1-3: Environment Setup
	- Installs scikit-learn, pandas, matplotlib, seaborn, nltk
	- Imports all necessary libraries
	- Sets path to data directory (`./math`)

	### Block 4: Data Consolidation
	Purpose: Convert JSON files to Parquet format
	- Input: `./math/train/` and `./math/test/` folders with JSON files
	- Output: `train.parquet` and `test.parquet`
	- Benefit: 10-100x faster loading than JSON
	- Run: Only once (skip if Parquet files already exist)

	### Block 5-7: Class Definitions
	Define three main classes:
	- MathDatasetLoader: Loads Parquet files, shows statistics
	- MathFeatureExtractor: Cleans LaTeX, extracts math symbols, preprocesses text
	- MathQuestionClassifier: Trains models, evaluates performance

	### Block 8: Load Data
	- Loads `train.parquet` and `test.parquet`
	- Shows class distribution for train and test sets
	- Displays 2 bar charts (train/test distribution)

	### Block 9: Feature Extraction
	Extracts three types of features:
	1. Text features: Preprocessed text (LaTeX cleaning, lemmatization)
	2. Math symbol features: 10 binary indicators (has_fraction, has_sqrt, etc.)
	3. Numeric features: 5 statistical measures (num_count, avg_number, etc.)

	### Block 10: Vectorization
	- Creates TF-IDF features (5000 dimensions, trigrams)
	- Scales additional features to [0,1] using MinMaxScaler
	- Critical: Fits ONLY on training data (prevents data leakage)
	- Converts to CSR format for efficient operations

	### Block 11: Model Training
	Trains 5 optimized models:
	1. Naive Bayes (baseline)
	2. Logistic Regression (linear classifier)
	3. SVM (maximum margin)
	4. Random Forest (ensemble)
	5. Gradient Boosting (sequential ensemble)

	Output:
	- Comparison table with Accuracy, F1-Score, Training Time
	- 2 bar charts comparing performance and speed
	- Selects best model automatically

	### Block 12: Detailed Evaluation
	- Confusion matrix visualization
	- Classification report (precision, recall, F1 per class)
	- Feature importance (for tree-based models)

	### Block 13: Complete Analysis
	Comprehensive evaluation on entire test set

	6 Visualizations:
	1. Confusion Matrix (absolute counts)
	2. Normalized Confusion Matrix (proportions)
	3. F1-Score by Topic (horizontal bar chart)
	4. Precision vs Recall (scatter plot, size = support)
	5. Test Set Distribution (bar chart)
	6. Confidence Distribution (histogram: correct vs incorrect)

	Analysis Sections:
	- Overall performance (accuracy, F1-score)
	- Per-class metrics table
	- Confusion pair analysis
	- Summary statistics

	---

	## Expected Results

	### Model Performance (F1-Score)
	- Gradient Boosting: 86-90%
	- Logistic Regression: 85-89%
	- SVM: 84-88%
	- Naive Bayes: 78-82%
	- Random Forest: 75-82% (expected to underperform on sparse features)

	### Training Time
	- Naive Bayes: ~10 seconds
	- Logistic Regression: ~30 seconds
	- SVM: ~2 minutes
	- Random Forest: ~3 minutes
	- Gradient Boosting: ~5 minutes

	### Per-Topic Performance
	High Performance (F1 > 90%):
	- counting_and_probability
	- number_theory

	Medium Performance (F1: 85-90%):
	- geometry
	- precalculus

	Challenging (F1: 80-85%):
	- algebra ↔ intermediate_algebra (similar concepts)
	- prealgebra ↔ algebra (overlapping operations)

	---

	## Key Design Decisions

	### 1. Data Leakage Prevention
	Critical: TF-IDF vectorizer fitted ONLY on training data
	```
	Train/Test Split → Fit Vectorizer on Train → Transform Both
	```
	Without this, test vocabulary leaks into training, inflating performance by 1-3%.

	### 2. Feature Engineering
	Hybrid approach:
	- TF-IDF (5000 features): Captures text content
	- Math symbols (10 features): Topic indicators (e.g., integrals → calculus)
	- Numeric features (5 features): Statistical properties

	Why no hand-crafted keywords?
	Avoided topic-specific keyword lists to prevent heuristic bias. Let the model learn discriminative vocabulary from data.

	### 3. Hyperparameter Optimization
	All models use optimized parameters:
	- C=1.0 (SVM/Logistic): Balanced regularization
	- max_depth=30 (Random Forest): Sufficient complexity
	- subsample=0.8 (Gradient Boosting): Stochastic sampling prevents overfitting

	### 4. Class Imbalance Handling
	`class_weight='balanced'` automatically adjusts weights inversely proportional to class frequencies.

	---

	## Methodology

	### Problem Type
	Supervised Multi-Class Text Classification

	Why Classification (not Clustering)?
	- Categories are predefined and labeled
	- Objective: Assign to known subtopic
	- Not discovering latent groups
	- Supervised learning with known labels

	### Pipeline
	```
	JSON Files
	↓
	Parquet Conversion (Block 4)
	↓
	Feature Extraction (Block 9)
	↓
	TF-IDF Vectorization (Block 10)
	↓
	Model Training (Block 11)
	↓
	Evaluation (Blocks 12-13)
	```

	### Feature Vector
	```
	Total: 5015 dimensions
	├── TF-IDF: 5000 (unigrams, bigrams, trigrams)
	├── Math Symbols: 10 (binary indicators)
	└── Numeric: 5 (scaled to [0,1])
	```

	---

	## Troubleshooting

	### "No data loaded"
	Solution: Check data path in Block 3
	```python
	DATA_PATH = './math' # Adjust to your path
	```

	### "NameError: name 'results' is not defined"
	Solution: Run blocks in order. Block 12-13 need Block 11 first.

	### "ValueError: Negative values"
	Solution: Block 10 should complete successfully. MinMaxScaler scales features to [0,1].

	### "TypeError: coo_matrix not subscriptable"
	Solution: Block 10 converts to CSR format. Ensure it runs completely.

	### Model underperforms
	Check:
	1. Data leakage prevented? (Vectorizer fitted on train only)
	2. Features extracted correctly? (Block 9 output)
	3. Class distribution balanced? (Block 8 charts)

	---

	## Performance Optimization

	### Speed Up Training
	```python
	# Reduce vocabulary
	vectorizer_config = {'max_features': 2000}

	# Fewer trees
	RandomForestClassifier(n_estimators=100)

	# Fewer boosting rounds
	GradientBoostingClassifier(n_estimators=50)
	```

	### Reduce Memory
	```python
	# Smaller vocabulary
	vectorizer_config = {'max_features': 3000}

	# Fewer n-grams
	vectorizer_config = {'ngram_range': (1, 2)}
	```

	---

	## Output Files

	After Block 13 completes, you'll have:
	- train.parquet: Training data (consolidated)
	- test.parquet: Test data (consolidated)
	- Performance metrics and visualizations
	- Model saved in memory (classifier.best_model)

	---

	## Next Steps

	### Save Model
	Add after Block 13:
	```python
	import pickle
	model_data = {
	'model': classifier.best_model,
	'vectorizer': classifier.vectorizer,
	'scaler': classifier.scaler,
	'label_encoder': classifier.label_encoder
	}
	with open('model.pkl', 'wb') as f:
	pickle.dump(model_data, f)
	```

	### Batch Prediction
	```python
	# Load model
	with open('model.pkl', 'rb') as f:
	model_data = pickle.load(f)

	# Predict
	new_problems = ["Solve x^2 = 16", "Find area of circle"]
	for problem in new_problems:
	# Preprocess → Extract features → Predict
	prediction = model.predict(...)
	```

	---

	## Summary

	13 Blocks, 3 Stages:
	1. Setup (Blocks 1-7): One-time environment setup
	2. Training (Blocks 8-11): Data loading and model training
	3. Evaluation (Blocks 12-13): Comprehensive analysis

	Key Features:
	- Data leakage prevention
	- 5 optimized models
	- 6 visualization types
	- Probability predictions
	- Error analysis

	Expected Time: 10-15 minutes total (including training)

	Expected Performance: 85-90% F1-score on test set