Communication-Inspired Tokenization for Structured Image Representations
Abstract
COMiT framework learns structured discrete visual tokens through iterative encoding and flow-matching decoding, improving object-centric representation and compositional generalization.
Discrete image tokenizers have emerged as a key component of modern vision and multimodal systems, providing a sequential interface for transformer-based architectures. However, most existing approaches remain primarily optimized for reconstruction and compression, often yielding tokens that capture local texture rather than object-level semantic structure. Inspired by the incremental and compositional nature of human communication, we introduce COMmunication inspired Tokenization (COMiT), a framework for learning structured discrete visual token sequences. COMiT constructs a latent message within a fixed token budget by iteratively observing localized image crops and recurrently updating its discrete representation. At each step, the model integrates new visual information while refining and reorganizing the existing token sequence. After several encoding iterations, the final message conditions a flow-matching decoder that reconstructs the full image. Both encoding and decoding are implemented within a single transformer model and trained end-to-end using a combination of flow-matching reconstruction and semantic representation alignment losses. Our experiments demonstrate that while semantic alignment provides grounding, attentive sequential tokenization is critical for inducing interpretable, object-centric token structure and substantially improving compositional generalization and relational reasoning over prior methods.
Community
A new way to represent images as discrete sequences of tokens by sequentially integrating information from image crops, yielding semantically meaningful structured representations.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ResTok: Learning Hierarchical Residuals in 1D Visual Tokenizers for Autoregressive Image Generation (2026)
- Language-Guided Transformer Tokenizer for Human Motion Generation (2026)
- Composable Visual Tokenizers with Generator-Free Diagnostics of Learnability (2026)
- VTok: A Unified Video Tokenizer with Decoupled Spatial-Temporal Latents (2026)
- OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence (2026)
- NativeTok: Native Visual Tokenization for Improved Image Generation (2026)
- Improving Flexible Image Tokenizers for Autoregressive Image Generation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 3
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper