SmolLM2-70M

A SmolLM2-70M model pretrained on the Sutra-10B pedagogical dataset for 3 epochs (~30.6B tokens total). This model demonstrates that a 69M parameter model can be trained to near-capacity performance using dense, curated educational data.

Model Details

Property	Value
Architecture	LlamaForCausalLM
Parameters	69.2M
Hidden Size	384
Layers	32
Attention Heads	6 (2 KV heads)
Context Length	8,192
Vocabulary	49,152
Precision	bfloat16
Base Model	SmolLM2-70M
Training Dataset	Sutra-10B (10.2B tokens)

Training

The model was trained for 3 epochs on the Sutra-10B dataset using a single NVIDIA L40S GPU (46GB). This checkpoint is the best perplexity checkpoint from epoch 3.

Epoch	Tokens	Training Time	Learning Rate	Best Perplexity
1	10.2B	25.82h	3e-4 → 3e-5	39.50
2	10.2B	25.78h	1e-4 → 1e-5	37.81
3	10.2B	26.16h	3e-5 → 3e-6	37.72
Total	30.6B	77.76h	—	37.72

Training configuration:

Optimizer: AdamW (fused), weight decay 0.1
Schedule: Cosine with warmup
Batch size: 4 per device, gradient accumulation 8 (effective ~262K tokens/step)
Sequence length: 8,192
Flash Attention 2, TF32 matmul, torch.compile
Throughput: ~110K tokens/sec

Benchmark Results

All benchmarks evaluated using lm-evaluation-harness v0.4.11. All tasks are 0-shot except GSM8K (5-shot).

This Model vs Training Progression

Benchmark	E3-best	E3-final	E2-best	E2-final	E1-final
ARC-Easy	33.00	33.16	32.83	33.12	33.46
ARC-Challenge	22.35	21.67	22.61	22.44	22.44
BoolQ	39.66	39.66	39.79	39.54	39.79
HellaSwag	26.14	26.03	26.08	25.91	26.03
PIQA	54.84	55.01	54.24	54.13	54.62
SciQ	45.20	46.30	44.10	45.50	43.60
WinoGrande	50.04	49.33	50.51	48.70	48.78
TruthfulQA	48.02	47.93	48.30	48.14	48.30
GSM8K	0.53	0.61	0.68	0.83	0.15
MMLU	22.96	22.87	23.00	22.98	22.99
OpenBookQA	27.60	27.60	—	—	—
Average (10)	34.27	34.26	34.21	34.13	34.02

Comparison with 1B Token Baselines (SmolLM2-70M)

These are results from training the same SmolLM2-70M model on various 1B-token datasets from the Pre-training Dataset Samples collection for 1 epoch, showing that Sutra-10B at 3 epochs achieves the highest performance for this model size.

Dataset (1B tokens)	HellaSwag	PIQA	WinoGrande	ARC-C	MMLU	TruthfulQA	GSM8K	Avg
Sutra-10B (3 epochs)	26.14	54.84	50.04	22.35	22.96	48.02	0.53	34.27
Sutra-1B	25.43	53.86	49.41	23.04	22.91	49.09	1.14	32.13
FineWiki-1B	25.56	51.69	48.86	24.15	23.34	51.16	0.91	32.24
FinePDFs-1B	25.58	52.56	50.51	22.44	22.95	51.41	1.21	32.38
DCLM-Baseline-1B	25.85	55.17	50.20	21.08	22.97	49.21	0.68	32.16
FineWeb-Edu-1B	25.72	55.11	50.36	21.25	22.96	48.11	1.21	32.10
Essential-Web-1B	26.02	55.44	48.30	20.99	22.95	49.59	1.29	32.08
Synth-1B	26.63	50.98	48.78	21.93	23.24	47.10	1.29	31.42

Key Findings

Capacity ceiling: The 70M parameter model reaches its capacity ceiling at approximately 10B tokens. Additional epochs (up to 30.6B total tokens) yield only marginal improvements in benchmark scores (+0.25 average from epoch 1 to epoch 3), despite continued perplexity improvement (39.50 → 37.72).
Perplexity vs benchmarks: Perplexity continues to decrease across epochs, but downstream benchmark performance plateaus, suggesting the model's representational capacity is the bottleneck rather than data exposure.
Data quality matters: Even at 1B tokens, Sutra outperforms or matches larger web-crawled datasets (DCLM, FineWeb-Edu, Essential-Web) on average, demonstrating the value of curated pedagogical content.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("codelion/SmolLM2-70M", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("codelion/SmolLM2-70M")

input_text = "The theory of relativity states that"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Limitations

This is a 69M parameter base model (not instruction-tuned) — it generates completions, not conversational responses
Performance is at the capacity ceiling for this model size; larger models would benefit more from the Sutra-10B dataset
The model was trained primarily on English educational content

Related Resources

Dataset: codelion/sutra-10B — 10B token pedagogical pretraining dataset
Sutra Framework: Generates structured educational content optimized for LLM pretraining

Citation

@article{sharma2026sutra,
  title={Scaling Pedagogical Pretraining: From Optimal Mixing to 10 Billion Tokens},
  author={Sharma, Asankhaya},
  year={2026},
  url={https://huggingface.co/blog/codelion/scaling-pedagogical-pretraining-10-billion-tokens}
}

License

Apache 2.0

Downloads last month: 41

Safetensors

Model size

69.2M params

Tensor type

BF16

Model tree for codelion/SmolLM2-70M

Quantizations

1 model

Dataset used to train codelion/SmolLM2-70M

Collection including codelion/SmolLM2-70M

Nano Language Models

Collection

A collection of really small language models pre-trained from scratch with open-data. Ideal for use in experimentation and evaluations. • 3 items • Updated Mar 25 • 1

Evaluation results

Normalized Accuracy (0-shot) on ARC-Easy
self-reported

33.000
Normalized Accuracy (0-shot) on ARC-Challenge
self-reported

22.350
Accuracy (0-shot) on BoolQ
self-reported

39.660
Normalized Accuracy (0-shot) on HellaSwag
self-reported

26.140
Normalized Accuracy (0-shot) on PIQA
self-reported

54.840
Normalized Accuracy (0-shot) on SciQ
self-reported

45.200
Accuracy (0-shot) on WinoGrande
self-reported

50.040
Accuracy (0-shot) on TruthfulQA MC2
self-reported

48.020
Exact Match (5-shot) on GSM8K
self-reported

0.530
Accuracy (0-shot) on MMLU
self-reported

22.960
Normalized Accuracy (0-shot) on OpenBookQA
self-reported

27.600