ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing
Paper • 2506.21448 • Published • 9
Mirrored and converted from FunAudioLLM/PrismAudio.
All weights have been converted from PyTorch .ckpt/.pth to SafeTensors format for:
| File | Description |
|---|---|
prismaudio.safetensors |
Main PrismAudio model weights (518M params) |
synchformer_state_dict.safetensors |
Synchformer temporal alignment encoder |
vae.safetensors |
Oobleck VAE decoder |
These weights are used by the MAESTRO AI Workstation's PrismAudio panel for decomposed Chain-of-Thought video-to-audio generation.
@misc{liu2025thinksound,
title={ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing},
author={Huadai Liu and Jialei Wang and Kaicheng Luo and Wen Wang and Qian Chen and Zhou Zhao and Wei Xue},
year={2025},
eprint={2506.21448},
archivePrefix={arXiv},
}
Unable to build the model tree, the base model loops to the model itself. Learn more.