Papers
arxiv:2502.16435

Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs

Published on Feb 23, 2025
Authors:
,
,
,
,
,
,
,
,
,

Abstract

VisFactor benchmark reveals that current MLLMs perform poorly on fundamental visual cognition tasks, indicating a gap between existing models and human-like visual understanding.

AI-generated summary

Humans develop perception through a bottom-up hierarchy: from basic primitives and Gestalt principles to high-level semantics. In contrast, current Multimodal Large Language Models (MLLMs) are trained directly on complex downstream tasks, often bypassing these foundational visual capabilities. To systematically investigate this gap, we introduce VisFactor, a benchmark that digitizes 20 vision-centric subtests from FRCT, a well-established cognitive psychology assessment spanning four domains of human visual cognition. Furthermore, we design algorithms to automatically construct and validate unlimited test cases with controllable difficulty. Using VisFactor, we evaluate 23 frontier MLLMs, including both proprietary (e.g., GPT, Gemini) and open-source (e.g., LLaMA, Qwen) models. The best model achieves a score of only 30.17%. Models consistently fail on tasks such as mental rotation, spatial relation inference, and figure-ground discrimination, regardless of model size or prompting strategy. These findings suggest that performance improvements on existing general benchmarks might represent castles in the air instead of a genuine mastery of human-like visual cognition.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2502.16435 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2502.16435 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2502.16435 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.