MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use Paper • 2509.24002 • Published Sep 28, 2025 • 176
view article Article Community Evals: Because we're done trusting black-box leaderboards over the community +5 11 days ago • 66
NoCode-bench: A Benchmark for Evaluating Natural Language-Driven Feature Addition Paper • 2507.18130 • Published Jul 24, 2025 • 1
ChartVerse: Scaling Chart Reasoning via Reliable Programmatic Synthesis from Scratch Paper • 2601.13606 • Published 26 days ago • 11
MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods Paper • 2601.21821 • Published 16 days ago • 59
BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution Paper • 2510.08697 • Published Oct 9, 2025 • 39
Kimi-K2 Collection Moonshot's MoE LLMs with 1 trillion parameters, exceptional on agentic intellegence • 5 items • Updated 19 days ago • 172
Spark: Strategic Policy-Aware Exploration via Dynamic Branching for Long-Horizon Agentic Learning Paper • 2601.20209 • Published 18 days ago • 22
AgentDoG Collection A Diagnostic Guardrail Framework for AI Agent Safety and Security • 11 items • Updated 18 days ago • 101
EvoCUA: Evolving Computer Use Agents via Learning from Scalable Synthetic Experience Paper • 2601.15876 • Published 24 days ago • 90