arxiv:2602.11731

Thinking with Drafting: Optical Decompression via Logical Reconstruction

Published on Feb 12

· Submitted by

Cheng Tan on Feb 13

ByteDance

Upvote

Authors:

Siyuan Li ,

Cheng Tan

Abstract

Visual reasoning is enhanced by reconstructing logical structures from compressed visual tokens through a DSL-based approach that generates deterministic visual proofs for verification.

AI-generated summary

Existing multimodal large language models have achieved high-fidelity visual perception and exploratory visual generation. However, a precision paradox persists in complex reasoning tasks: optical perception systems transcribe symbols without capturing logical topology, while pixel-based generative models produce visual artifacts lacking mathematical exactness. To bridge this gap, we propose that reasoning over visual inputs be reconceptualized as optical decompression-the process of reconstructing latent logical structures from compressed visual tokens. Guided by the axiom that Parsing is Reasoning, we introduce Thinking with Drafting (TwD), which utilizes a minimalist Domain-Specific Language (DSL) as a grounding intermediate representation. Unlike standard approaches that hallucinate answers directly, TwD forces the model to draft its mental model into executable code, rendering deterministic visual proofs for self-verification. To validate this, we present VisAlg, a visual algebra benchmark. Experiments demonstrate that TwD serve as a superior cognitive scaffold. Our work establishes a closed-loop system where visual generation acts not as a creative output but as a logical verifier, offering a generalizable path for visual reasoning.

View arXiv page View PDF Add to collection

Community

chengtan9907

Paper author Paper submitter 1 day ago

The core idea of Thinking with Drafting (TwD)

is super refreshing: instead of letting a multimodal model “guess the answer” with fluent CoT or pretty-looking diagrams, it forces the model to draft its reasoning into executable structure. Not vibes. Not plausible pixels. But strict, renderable DSL code.

The “optical decompression” framing is also 🔥 — OCR gives you symbols, but not logical topology. TwD says: real understanding = reconstructing the hidden structure behind those symbols. And the moment the model has to commit to aligned segments, brackets, and cross-row constraints, hallucination becomes much harder.

What I like most is the shift from:

generate explanation → hope it’s right
to
generate structure → verify it deterministically

That feels like a big step toward trustworthy multimodal reasoning.