sarashina2.2-tts
sarashina2.2-tts is a Japanese-centric text-to-speech system built on a large language model, developed by SB Intuitions. It supports Japanese and English, delivering high pronunciation accuracy, naturalness, and stability across diverse speaking styles, with zero-shot voice generation support.
Highlights
- 🇯🇵 Japanese-Centric: Designed and optimized specifically for Japanese, with broad coverage of real-world use cases.
- 🎯 High Accuracy: Delivers strong pronunciation accuracy on Japanese text through large-scale end-to-end training.
- 🔒 Responsibly Sourced Training Data: Trained exclusively on legitimately acquired and properly licensed speech data.
- 🎙️ Zero-shot Voice Generation: Reproduces a speaker's voice, speaking style, and acoustic characteristics from a short reference clip.
- 🔊 Natural & Expressive: Produces highly natural speech with consistent quality, supporting a wide range of speaking styles including narration, broadcast, conversation, and customer service.
- 🌐 Bilingual: Supports both Japanese and English text-to-speech synthesis.
Training Data
This model was trained on audio data collected from legitimately purchased audio sources, public speech archives, and data gathered in compliance with applicable domestic laws. During collection, we adhered to robots.txt directives and terms of service to ensure proper data acquisition.
Usage
For installation instructions, Docker setup, and detailed usage, please refer to the GitHub repository.
Audio Samples
The samples below demonstrate key capabilities of sarashina2.2-tts:
- Speaking Style Variety: Transfers diverse speaking styles — narration, broadcast, conversation, customer service, and more — from reference audio.
- Zero-shot Voice Cloning: Reproduces a speaker's voice from just a few seconds of reference speech, with no fine-tuning required.
- Cross-lingual Generation: Preserves speaker identity and speaking style consistently across Japanese and English.
- Code Switching: Handles mixed Japanese-English sentences naturally within a single utterance.
Quick Sample
| Zero-shot Speaker Adaptation 東京から金沢までは新幹線を利用するのが便利で、所要時間は約2時間半です。 | |
| Reference | Generated |
|---|---|
| Diverse Speaking Styles お待たせいたしました。お客様のSoftBank光のご契約状況が確認できました。あわせて、Y!mobileとのおうち割 光セットの適用状況をお調べしたいのですが、現在お使いの携帯電話番号をお伺いしてもよろしいでしょうか? | |
| Reference | Generated |
| English Generation There is something remarkable about the way language shapes the way we think. A single phrase, spoken in the right tone, can carry emotions that words alone cannot express. | |
| Reference | Generated |
Speaking Style Variety
| Narration 午前2時。東京・下町の一角。静まり返った住宅街に、リズミカルに包丁を叩く音が響く。店主の佐藤は、この場所で40年、変わらずにスープを炊き続けてきた。 | |
| Reference | Generated |
|---|---|
| Broadcast 国土交通省は15日、過疎地域や山間部における配送ルートの認可プロセスを簡略化する新指針を発表した。これにより、従来は数ヶ月を要していた飛行許可の申請期間が大幅に短縮される見通しだ。 | |
| Reference | Generated |
| Conversation なるほど。じゃーちょっとすいませんそのー最近ハマってることについて、もう少しだけお話していただいてもいいですか? | |
| Reference | Generated |
| Customer Service お待たせいたしました。ご契約状況を確認したのですが、一点だけ補足で伺わせてください。ご登録いただいているお電話番号の下4桁、もしくはご生年月日を念のためお伺いしてもよろしいでしょうか? | |
| Reference | Generated |
| Rakugo 「こんな安月給でやってられっかい、仕事なんかもう辞めたらあ!」て酒場で管巻いとったおっさんがな。翌朝になったら、誰よりもはよ店出て、鼻歌まじりに丁稚使いよる。 | |
| Reference | Generated |
Zero-shot Voice Generation
| 東京から金沢までは新幹線を利用するのが便利で、所要時間は約2時間半です。 | ||
| Speaker | Reference | Generated |
|---|---|---|
| Speaker A (Male) | ||
| Speaker B (Female) | ||
| Speaker C (Female) | ||
| Speaker D (Senior Female) | ||
| The bullet train from Tokyo to Kanazawa takes approximately two and a half hours, making it the most convenient option for travel. | ||
| Speaker | Reference | Generated |
| Speaker E (Female) | ||
| Speaker F (Male) | ||
Cross-lingual Zero-shot
| English Speaker to Japanese Generation 東京から金沢までは新幹線を利用するのが便利で、所要時間は約2時間半です。 | |
| Reference | Generated |
|---|---|
| Japanese Speaker to English Generation The bullet train from Tokyo to Kanazawa takes approximately two and a half hours, making it the most convenient option for travel. | |
| Reference | Generated |
Code Switching
| Mixed Japanese-English Sentence 最新のAI technologies、特にlarge language modelsは、音声合成の分野に大きなRevolutionをもたらしています。 | |
| Reference | Generated |
|---|---|
Acknowledgments
This model is built upon or incorporates code and models from the following open-source projects:
License
This model is licensed under Sarashina Model NonCommercial License Agreement.
If you are interested in using this model for commercial purposes, please feel free to contact us through our contact page.
The audio provided on this page are for research purposes only and may not be redistributed or used for commercial purposes.
- Downloads last month
- 19
Model tree for sbintuitions/sarashina2.2-tts
Base model
sbintuitions/sarashina2.2-0.5b