Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training
Abstract
DeMix is a framework that uses model merging to predict optimal data ratios for LLM pre-training, decoupling search from training costs to improve mixture discovery efficiency.
Determining an effective data mixture is a key factor in Large Language Model (LLM) pre-training, where models must balance general competence with proficiency on hard tasks such as math and code. However, identifying an optimal mixture remains an open challenge, as existing approaches either rely on unreliable tiny-scale proxy experiments or require prohibitively expensive large-scale exploration. To address this, we propose Decouple Searching from Training Mix (DeMix), a novel framework that leverages model merging to predict optimal data ratios. Instead of training proxy models for every sampled mixture, DeMix trains component models on candidate datasets at scale and derives data mixture proxies via weighted model merging. This paradigm decouples search from training costs, enabling evaluation of unlimited sampled mixtures without extra training burden and thus facilitating better mixture discovery through more search trials. Extensive experiments demonstrate that DeMix breaks the trade-off between sufficiency, accuracy and efficiency, obtaining the optimal mixture with higher benchmark performance at lower search cost. Additionally, we release the DeMix Corpora, a comprehensive 22T-token dataset comprising high-quality pre-training data with validated mixtures to facilitate open research. Our code and DeMix Corpora is available at https://github.com/Lucius-lsr/DeMix.
Community
Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MergeMix: Optimizing Mid-Training Data Mixtures via Learnable Model Merging (2026)
- TREX: Tokenizer Regression for Optimal Data Mixture (2026)
- AfriqueLLM: How Data Mixing and Model Architecture Impact Continued Pre-training for African Languages (2026)
- MiniLingua: A Small Open-Source LLM for European Languages (2025)
- Can Small Training Runs Reliably Guide Data Curation? Rethinking Proxy-Model Practice (2025)
- Grow Up and Merge: Scaling Strategies for Efficient Language Adaptation (2025)
- ProFit: Leveraging High-Value Signals in SFT via Probability-Guided Token Selection (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper