John Smith's picture
In a Training Loop 🔄

John Smith PRO

John6666

AI & ML interests

None yet

Recent Activity

reacted to rajkumarrawal's post with 👍 about 6 hours ago
I submitted a "AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts" Paper by @weizhihao1KeyuLi Junhao shi @dqwangDequan Wang @YangXiao-nlpYang Xiao Mohan Jiang @Sunshine279Jie Sun Yunze Wu Shijie Xia Xiaojie Cai Tianze Xu Weiye Si Wenjie Li Pengfei Liu From https://huggingface.co/SJTU Shanghai Jiao Tong University https://huggingface.co/PolyUHK The Hong Kong Polytechnic University https://huggingface.co/GAIRSII-GAIR to Daily Papers on https://huggingface.co/huggingfaceHugging Face. A potentially another direction for Benchmarking the Frontiers of Autonomous Agents in 2026 Some of the observations founded are :- -- Long-horizon tasks remain challenging : Even frontier models struggle with sustained reasoning over real world tasks that require 1M tokens and 90 tool calls, indicating limits in long context autonomy. -- Proprietary models outperform open source models: Closed source models achieve a higher average score (48.4%) than open source counterparts (32.1%), revealing a persistent performance gap on complex agentic tasks. -- Feedback driven self correction varies widely: Models like GPT 5.2 and Claude show strong gains from iterative feedback, while others (e.g. DeepSeek V3.2) exhibit minimal or no improvement after feedback. -- Efficiency trade offs are significant: High performing models often consume far more tokens and time, some models (e.g. Grok 4.1 Fast) are more token efficient despite lower absolute scores. -- Agentic scaffolds strongly influence performance: Models tend to perform best within their native or optimized ecosystems, highlighting that agent performance depends on tight coupling between the model and its scaffold not the model alone. ..... many more... https://huggingface.co/papers/2601.11044
View all activity

Organizations

Glide's profile picture open/ acc's profile picture Solving Real World Problems's profile picture FashionStash Group meeting's profile picture No More Copyright's profile picture XORTRON - Criminal Computing's profile picture