Thinking with Drafting: Optical Decompression via Logical Reconstruction
Abstract
Visual reasoning is enhanced by reconstructing logical structures from compressed visual tokens through a DSL-based approach that generates deterministic visual proofs for verification.
Existing multimodal large language models have achieved high-fidelity visual perception and exploratory visual generation. However, a precision paradox persists in complex reasoning tasks: optical perception systems transcribe symbols without capturing logical topology, while pixel-based generative models produce visual artifacts lacking mathematical exactness. To bridge this gap, we propose that reasoning over visual inputs be reconceptualized as optical decompression-the process of reconstructing latent logical structures from compressed visual tokens. Guided by the axiom that Parsing is Reasoning, we introduce Thinking with Drafting (TwD), which utilizes a minimalist Domain-Specific Language (DSL) as a grounding intermediate representation. Unlike standard approaches that hallucinate answers directly, TwD forces the model to draft its mental model into executable code, rendering deterministic visual proofs for self-verification. To validate this, we present VisAlg, a visual algebra benchmark. Experiments demonstrate that TwD serve as a superior cognitive scaffold. Our work establishes a closed-loop system where visual generation acts not as a creative output but as a logical verifier, offering a generalizable path for visual reasoning.
Community
The core idea of Thinking with Drafting (TwD)
is super refreshing: instead of letting a multimodal model “guess the answer” with fluent CoT or pretty-looking diagrams, it forces the model to draft its reasoning into executable structure. Not vibes. Not plausible pixels. But strict, renderable DSL code.
The “optical decompression” framing is also 🔥 — OCR gives you symbols, but not logical topology. TwD says: real understanding = reconstructing the hidden structure behind those symbols. And the moment the model has to commit to aligned segments, brackets, and cross-row constraints, hallucination becomes much harder.
What I like most is the shift from:
generate explanation → hope it’s right
to
generate structure → verify it deterministically
That feels like a big step toward trustworthy multimodal reasoning.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VisTIRA: Closing the Image-Text Modality Gap in Visual Math Reasoning via Structured Tool Integration (2026)
- Beyond Accuracy: Evaluating Grounded Visual Evidence in Thinking with Images (2026)
- Unified Thinker: A General Reasoning Modular Core for Image Generation (2026)
- LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning (2026)
- TangramPuzzle: Evaluating Multimodal Large Language Models with Compositional Spatial Reasoning (2026)
- CogFlow: Bridging Perception and Reasoning through Knowledge Internalization for Visual Mathematical Problem Solving (2026)
- UReason: Benchmarking the Reasoning Paradox in Unified Multimodal Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
