Papers
arxiv:2602.13013

Towards Universal Video MLLMs with Attribute-Structured and Quality-Verified Instructions

Published on Feb 13
· Submitted by
Yunheng Li
on Feb 16
Authors:
,
,
,
,
,
,

Abstract

A large-scale dataset and model for fine-grained audiovisual understanding are introduced, demonstrating improved caption quality and reduced hallucinations through structured annotations and supervised fine-tuning.

AI-generated summary

Universal video understanding requires modeling fine-grained visual and audio information over time in diverse real-world scenarios. However, the performance of existing models is primarily constrained by video-instruction data that represents complex audiovisual content as single, incomplete descriptions, lacking fine-grained organization and reliable annotation. To address this, we introduce: (i) ASID-1M, an open-source collection of one million structured, fine-grained audiovisual instruction annotations with single- and multi-attribute supervision; (ii) ASID-Verify, a scalable data curation pipeline for annotation, with automatic verification and refinement that enforces semantic and temporal consistency between descriptions and the corresponding audiovisual content; and (iii) ASID-Captioner, a video understanding model trained via Supervised Fine-Tuning (SFT) on the ASID-1M. Experiments across seven benchmarks covering audiovisual captioning, attribute-wise captioning, caption-based QA, and caption-based temporal grounding show that ASID-Captioner improves fine-grained caption quality while reducing hallucinations and improving instruction following. It achieves state-of-the-art performance among open-source models and is competitive with Gemini-3-Pro.

Community

Paper submitter

ASID-Caption introduces a data-and-model suite for fine-grained audiovisual video understanding.

We present:
• ASID-1M, a large-scale collection of attribute-structured and quality-verified audiovisual instruction annotations;
• ASID-Verify, a scalable multi-stage pipeline that generates, ensembles, verifies, and refines captions with semantic and temporal consistency checks;
• ASID-Captioner, instruction-tuned audiovisual models trained on ASID-1M.

Our framework moves beyond “one video → one generic caption” by enabling controllable, attribute-aware supervision across scene, objects, actions, speech, camera, and narrative elements.

Project page: https://asid-caption.github.io/
Code: https://github.com/ASID-Caption/ASID-Caption

Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.13013 in a Space README.md to link it from this page.

Collections including this paper 1