| | --- |
| | datasets: |
| | - LEMAS-Project/LEMAS-Dataset-train |
| | - LEMAS-Project/LEMAS-Dataset-eval |
| | language: |
| | - it |
| | - pt |
| | - es |
| | - fr |
| | - de |
| | - vi |
| | - id |
| | - ru |
| | - en |
| | - zh |
| | license: cc-by-nc-4.0 |
| | pipeline_tag: text-to-speech |
| | tags: |
| | - zero-shot |
| | - multilingual |
| | --- |
| | |
| | # LEMAS-TTS |
| |
|
| | LEMAS-TTS is a multilingual zero-shot text-to-speech system, presented in the paper [LEMAS: A 150K-Hour Large-scale Extensible Multilingual Audio Suite with Generative Speech Models](https://huggingface.co/papers/2601.04233). |
| |
|
| | - **Project Page:** [https://lemas-project.github.io/LEMAS-Project](https://lemas-project.github.io/LEMAS-Project) |
| | - **Paper:** [https://arxiv.org/abs/2601.04233](https://arxiv.org/abs/2601.04233) |
| | - **GitHub Repository:** [https://github.com/LEMAS-Project/LEMAS-TTS](https://github.com/LEMAS-Project/LEMAS-TTS) |
| | - **Hugging Face Demo:** [https://huggingface.co/spaces/LEMAS-Project/LEMAS-TTS](https://huggingface.co/spaces/LEMAS-Project/LEMAS-TTS) |
| |
|
| | ## Model Description |
| |
|
| | LEMAS-TTS is built upon a non-autoregressive flow-matching framework. It leverages the massive scale and linguistic diversity of the LEMAS-Dataset to achieve robust zero-shot multilingual synthesis. The model incorporates accent-adversarial training and CTC loss to mitigate cross-lingual accent issues, enhancing synthesis stability and quality across diverse languages. |
| |
|
| | ## Supported Languages |
| |
|
| | The model supports 10 major languages for zero-shot synthesis: |
| | - Chinese (zh) |
| | - English (en) |
| | - Spanish (es) |
| | - Russian (ru) |
| | - French (fr) |
| | - German (de) |
| | - Italian (it) |
| | - Portuguese (pt) |
| | - Indonesian (id) |
| | - Vietnamese (vi) |
| |
|
| | ## Training Data |
| |
|
| | LEMAS-TTS was trained on the [LEMAS-Dataset](https://huggingface.co/datasets/LEMAS-Project/LEMAS-Dataset-train), which is, to our knowledge, currently the largest open-source multilingual speech corpus with word-level timestamps. It covers over 150,000 hours across 10 major languages. |
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @article{zhao2026lemas, |
| | title={LEMAS: A 150K-Hour Large-scale Extensible Multilingual Audio Suite with Generative Speech Models}, |
| | author={Zhao, Zhiyuan and Lin, Lijian and Zhu, Ye and Xie, Kai and Liu, Yunfei and Li, Yu}, |
| | journal={arXiv preprint arXiv:2601.04233}, |
| | year={2026} |
| | } |
| | ``` |