Stop the Flip-Flop: Context-Preserving Verification for Fast Revocable Diffusion Decoding
Abstract
COVER enables efficient parallel decoding for diffusion language models by implementing cache override verification that reduces unnecessary revisions and maintains output quality through stable drafting and attention view construction.
Parallel diffusion decoding can accelerate diffusion language model inference by unmasking multiple tokens per step, but aggressive parallelism often harms quality. Revocable decoding mitigates this by rechecking earlier tokens, yet we observe that existing verification schemes frequently trigger flip-flop oscillations, where tokens are remasked and later restored unchanged. This behaviour slows inference in two ways: remasking verified positions weakens the conditioning context for parallel drafting, and repeated remask cycles consume the revision budget with little net progress. We propose COVER (Cache Override Verification for Efficient Revision), which performs leave-one-out verification and stable drafting within a single forward pass. COVER constructs two attention views via KV cache override: selected seeds are masked for verification, while their cached key value states are injected for all other queries to preserve contextual information, with a closed form diagonal correction preventing self leakage at the seed positions. COVER further prioritises seeds using a stability aware score that balances uncertainty, downstream influence, and cache drift, and it adapts the number of verified seeds per step. Across benchmarks, COVER markedly reduces unnecessary revisions and yields faster decoding while preserving output quality.
Community
We found a silly failure mode in Parallel Revocable Diffusion Decoding: flip-flop . A token gets ReMask’ed… then comes back unchanged. In the existing approach, <1% of ReMasks actually change the token (≈99% wasted).
We propose COVER which verifies without nuking context: mask seeds for leave-one-out, but inject their cached K,V for everyone else. A simple diagonal correction removes self-leakage. Result: fewer useless revisions + faster parallel drafting.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Window-Diffusion: Accelerating Diffusion Language Model Inference with Windowed Token Pruning and Caching (2026)
- FlashBlock: Attention Caching for Efficient Long-Context Block Diffusion (2026)
- Reversible Diffusion Decoding for Diffusion Language Models (2026)
- DSB: Dynamic Sliding Block Scheduling for Diffusion LLMs (2026)
- Streaming-dLLM: Accelerating Diffusion LLMs via Suffix Pruning and Dynamic Decoding (2026)
- Deferred Commitment Decoding for Diffusion Language Models (2026)
- DAWN: Dependency-Aware Fast Inference for Diffusion LLMs (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper
