Code-Switching Information Retrieval: Benchmarks, Analysis, and the Limits of Current Retrievers Paper • 2604.17632 • Published 5 days ago • 10
AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation Paper • 2604.18240 • Published 4 days ago • 14
Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language Paper • 2604.19667 • Published 3 days ago • 16
TEMPO: Scaling Test-time Training for Large Reasoning Models Paper • 2604.19295 • Published 3 days ago • 28
SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents Paper • 2604.17308 • Published 5 days ago • 21
Don't Retrieve, Navigate: Distilling Enterprise Knowledge into Navigable Agent Skills for QA and RAG Paper • 2604.14572 • Published 8 days ago • 7
ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack Paper • 2509.25843 • Published 10 days ago • 19
IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures Paper • 2604.07709 • Published 10 days ago • 1
GlotOCR Bench: OCR Models Still Struggle Beyond a Handful of Unicode Scripts Paper • 2604.12978 • Published 10 days ago • 5
Masked by Consensus: Disentangling Privileged Knowledge in LLM Correctness Paper • 2604.12373 • Published 10 days ago • 9
BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation Paper • 2604.09497 • Published 14 days ago • 29
TradingAgents: Multi-Agents LLM Financial Trading Framework Paper • 2412.20138 • Published Dec 28, 2024 • 48
Kronos: A Foundation Model for the Language of Financial Markets Paper • 2508.02739 • Published Aug 2, 2025 • 24
The Depth Ceiling: On the Limits of Large Language Models in Discovering Latent Planning Paper • 2604.06427 • Published 17 days ago • 11
view article Article How I contributed a new model to the Transformers library using Codex 24 days ago • 48
Reasoning Shift: How Context Silently Shortens LLM Reasoning Paper • 2604.01161 • Published 22 days ago • 32