Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts Paper • 2601.22156 • Published 7 days ago • 10
Running 81 Unlocking On-Policy Distillation for Any Model Family 📝 81 Improve model performance by transferring knowledge between different model families