BadScientist: Can a Research Agent Write Convincing but Unsound Papers that Fool LLM Reviewers? Paper • 2510.18003 • Published Oct 20, 2025
Visual Aesthetic Benchmark: Can Frontier Models Judge Beauty? Paper • 2605.12684 • Published 9 days ago • 11
VisualSphinx: Large-Scale Synthetic Vision Logic Puzzles for RL Paper • 2505.23977 • Published May 29, 2025 • 10