Token-Based Audio Inpainting via Discrete Diffusion (AIDD)
Pretrained model weights for AIDD, introduced in:
Token-Based Audio Inpainting via Discrete Diffusion
ICLR 2026
https://arxiv.org/abs/2507.08333
AIDD performs audio inpainting by applying diffusion in a discrete token space, enabling semantically coherent reconstruction of missing audio segments, including long gaps of up to 750 ms.
Model
The model operates on discrete audio tokens produced by a pretrained WavTokenizer and performs inpainting using a Diffusion Transformer (DiT) trained with a discrete diffusion objective. The training incorporates span-based masking to model structured missing regions and a derivative-based regularization loss that encourages smooth temporal dynamics in token embedding space. The model is designed for restoring missing segments in musical audio, including long gaps.
Usage
This repository provides model weights only.
For code, see the official GitHub repository:
👉 https://github.com/iftachShoham/AIDD
Data & Evaluation
Trained and evaluated on MusicNet and MAESTRO, using FAD, LSD, ODG, and MOS metrics.
See the paper for full details.
Acknowledgments
Built upon
Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution and
WavTokenizer: An Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling.
We thank the authors for making their work publicly available.
Citation
@article{dror2025token,
title={Token-based Audio Inpainting via Discrete Diffusion},
author={Dror, Tali and Shoham, Iftach and Buchris, Moshe and Gal, Oren and Permuter, Haim and Katz, Gilad and Nachmani, Eliya},
journal={arXiv preprint arXiv:2507.08333},
year={2025}
}