SimpleGPT: Improving GPT via A Simple Normalization Strategy
Abstract
SimpleNorm normalization strategy stabilizes activation scales and enables larger stable learning rates in Transformer models by reducing Hessian spectral norm, leading to improved training performance.
In this work, we revisit Transformer optimization through the lens of second-order geometry and establish a direct connection between architectural design, activation scale, the Hessian matrix, and the maximum tolerable learning rate. We introduce a simple normalization strategy, termed SimpleNorm, which stabilizes intermediate activation scales by construction. Then, by analyzing the Hessian of the loss with respect to network activations, we theoretically show that SimpleNorm significantly reduces the spectral norm of the Hessian, thereby permitting larger stable learning rates. We validate our theoretical findings through extensive experiments on large GPT models at parameter scales 1B, 1.4B, 7B and 8B. Empirically, SimpleGPT, our SimpleNorm-based network, tolerates learning rates 3times-10times larger than standard convention, consistently demonstrates strong optimization stability, and achieves substantially better performance than well-established baselines. Specifically, when training 7B-scale models for 60K steps, SimpleGPT achieves a training loss that is 0.08 lower than that of LLaMA2 with QKNorm, reducing the loss from 2.290 to 2.208. Our source code will be released at https://github.com/Ocram7/SimpleGPT.
Community
This paper revisits Transformer optimization through the lens of second-order geometry and establish a direct connection between architectural design, activation scale, the Hessian matrix, and the maximum tolerable learning rate.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Controlled LLM Training on Spectral Sphere (2026)
- Post-LayerNorm Is Back: Stable, ExpressivE, and Deep (2026)
- MSign: An Optimizer Preventing Training Instability in Large Language Models via Stable Rank Restoration (2026)
- YuriiFormer: A Suite of Nesterov-Accelerated Transformers (2026)
- FOAM: Blocked State Folding for Memory-Efficient LLM Training (2025)
- GeoNorm: Unify Pre-Norm and Post-Norm with Geodesic Optimization (2026)
- FISMO: Fisher-Structured Momentum-Orthogonalized Optimizer (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper