arxiv:2507.13401

MADI: Masking-Augmented Diffusion with Inference-Time Scaling for Visual Editing

Published on Jul 16, 2025

Authors:

Abstract

MADI enhances diffusion models' controllable generation and editing capabilities through masking-augmented training and inference-time capacity scaling with pause tokens.

AI-generated summary

Despite the remarkable success of diffusion models in text-to-image generation, their effectiveness in grounded visual editing and compositional control remains challenging. Motivated by advances in self-supervised learning and in-context generative modeling, we propose a series of simple yet powerful design choices that significantly enhance diffusion model capacity for structured, controllable generation and editing. We introduce Masking-Augmented Diffusion with Inference-Time Scaling (MADI), a framework that improves the editability, compositionality and controllability of diffusion models through two core innovations. First, we introduce Masking-Augmented gaussian Diffusion (MAgD), a novel training strategy with dual corruption process which combines standard denoising score matching and masked reconstruction by masking noisy input from forward process. MAgD encourages the model to learn discriminative and compositional visual representations, thus enabling localized and structure-aware editing. Second, we introduce an inference-time capacity scaling mechanism based on Pause Tokens, which act as special placeholders inserted into the prompt for increasing computational capacity at inference time. Our findings show that adopting expressive and dense prompts during training further enhances performance, particularly for MAgD. Together, these contributions in MADI substantially enhance the editability of diffusion models, paving the way toward their integration into more general-purpose, in-context generative diffusion architectures.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2507.13401 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2507.13401 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.13401 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.