MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments
Abstract
A comprehensive memory-focused benchmark for mobile GUI agents reveals significant memory capability gaps and provides systematic evaluation methods and design insights.
Current mobile GUI agent benchmarks systematically fail to assess memory capabilities, with only 5.2-11.8% memory-related tasks and no cross-session learning evaluation. We introduce MemGUI-Bench, a comprehensive memory-centric benchmark with pass@k and staged LLM-as-judge evaluation. Our contributions include: (1) a systematic memory taxonomy analyzing 11 agents across 5 architectures; (2) 128 tasks across 26 applications where 89.8% challenge memory through cross-temporal and cross-spatial retention; (3) MemGUI-Eval, an automated pipeline with Progressive Scrutiny and 7 hierarchical metrics; and (4) RQ-driven assessment of 11 state-of-the-art agents. Our experiments reveal significant memory deficits across all evaluated systems, identify 5 distinct failure modes, and synthesize 5 actionable design implications. All resources including code, benchmark, and evaluation results will be \textit{fully open-sourced and continuously maintained} at https://lgy0404.github.io/MemGUI-Bench/.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces (2026)
- ContextBench: A Benchmark for Context Retrieval in Coding Agents (2026)
- TEMPO: A Realistic Multi-Domain Benchmark for Temporal Reasoning-Intensive Retrieval (2026)
- EvolMem: A Cognitive-Driven Benchmark for Multi-Session Dialogue Memory (2026)
- ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development (2026)
- Mem2ActBench: A Benchmark for Evaluating Long-Term Memory Utilization in Task-Oriented Autonomous Agents (2026)
- AgentLongBench: A Controllable Long Benchmark For Long-Contexts Agents via Environment Rollouts (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper