SP3F-7B

SP3F-7B is a multilingual model trained with Self-Play with Privileged Pairwise Feedback, we use Qwen2.5-7B as our base.

Model	Overall		MGSM		MT Math100		Belebele		Global MMLU Lite
Model	Acc	Lang	Acc	Lang	Acc	Lang	Acc	Lang	Acc	Lang
Qwen2.5-7B	14.79	78.78	22.15	90.67	21.16	58.22	7.52	80.39	8.34	85.85
+ SFT	21.70	82.11	33.66	91.37	26.72	58.26	12.94	89.18	13.48	89.62
+ RLVR	57.79	96.09	65.34	99.75	44.50	86.10	68.18	98.73	53.15	99.78
SP3F-7B	61.91	95.35	72.50	99.38	56.84	82.93	67.54	99.65	50.76	99.45
Qwen2.5-7B-Instruct	55.87	89.21	66.36	98.38	52.12	65.66	56.79	96.59	48.20	96.21
+ Translate Test	57.01	85.98	66.15	95.81	60.08	59.34	48.09	92.27	53.73	96.49

Citation

If you find this work helpful please use the following to cite our work.

@misc{sutawika2026gainedtranslationprivilegedpairwise,
      title={Gained in Translation: Privileged Pairwise Judges Enhance Multilingual Reasoning}, 
      author={Lintang Sutawika and Gokul Swamy and Zhiwei Steven Wu and Graham Neubig},
      year={2026},
      eprint={2601.18722},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2601.18722}, 
}