sarashina2.2-tts

sarashina2.2-tts is a Japanese-centric text-to-speech system built on a large language model, developed by SB Intuitions. It supports Japanese and English, delivering high pronunciation accuracy, naturalness, and stability across diverse speaking styles, with zero-shot voice generation support.

Highlights

🇯🇵 Japanese-Centric: Designed and optimized specifically for Japanese, with broad coverage of real-world use cases.
🎯 High Accuracy: Delivers strong pronunciation accuracy on Japanese text through large-scale end-to-end training.
🔒 Responsibly Sourced Training Data: Trained exclusively on legitimately acquired and properly licensed speech data.
🎙️ Zero-shot Voice Generation: Reproduces a speaker's voice, speaking style, and acoustic characteristics from a short reference clip.
🔊 Natural & Expressive: Produces highly natural speech with consistent quality, supporting a wide range of speaking styles including narration, broadcast, conversation, and customer service.
🌐 Bilingual: Supports both Japanese and English text-to-speech synthesis.

Training Data

This model was trained on audio data collected from legitimately purchased audio sources, public speech archives, and data gathered in compliance with applicable domestic laws. During collection, we adhered to robots.txt directives and terms of service to ensure proper data acquisition.

Usage

For installation instructions, Docker setup, and detailed usage, please refer to the GitHub repository.

Audio Samples

The samples below demonstrate key capabilities of sarashina2.2-tts:

Speaking Style Variety: Transfers diverse speaking styles — narration, broadcast, conversation, customer service, and more — from reference audio.
Zero-shot Voice Cloning: Reproduces a speaker's voice from just a few seconds of reference speech, with no fine-tuning required.
Cross-lingual Generation: Preserves speaker identity and speaking style consistently across Japanese and English.
Code Switching: Handles mixed Japanese-English sentences naturally within a single utterance.

Quick Sample

Reference	Generated
Zero-shot Speaker Adaptation 東京から金沢までは新幹線を利用するのが便利で、所要時間は約２時間半です。

Diverse Speaking Styles お待たせいたしました。お客様のSoftBank光のご契約状況が確認できました。あわせて、Y!mobileとのおうち割光セットの適用状況をお調べしたいのですが、現在お使いの携帯電話番号をお伺いしてもよろしいでしょうか？
Reference	Generated

English Generation There is something remarkable about the way language shapes the way we think. A single phrase, spoken in the right tone, can carry emotions that words alone cannot express.
Reference	Generated

Speaking Style Variety

Reference	Generated
Narration 午前2時。東京・下町の一角。静まり返った住宅街に、リズミカルに包丁を叩く音が響く。店主の佐藤は、この場所で40年、変わらずにスープを炊き続けてきた。

Broadcast 国土交通省は15日、過疎地域や山間部における配送ルートの認可プロセスを簡略化する新指針を発表した。これにより、従来は数ヶ月を要していた飛行許可の申請期間が大幅に短縮される見通しだ。
Reference	Generated

Conversation なるほど。じゃーちょっとすいませんそのー最近ハマってることについて、もう少しだけお話していただいてもいいですか？
Reference	Generated

Customer Service お待たせいたしました。ご契約状況を確認したのですが、一点だけ補足で伺わせてください。ご登録いただいているお電話番号の下4桁、もしくはご生年月日を念のためお伺いしてもよろしいでしょうか？
Reference	Generated

Rakugo 「こんな安月給でやってられっかい、仕事なんかもう辞めたらあ！」て酒場で管巻いとったおっさんがな。翌朝になったら、誰よりもはよ店出て、鼻歌まじりに丁稚使いよる。
Reference	Generated

Zero-shot Voice Generation

Speaker	Reference	Generated
東京から金沢までは新幹線を利用するのが便利で、所要時間は約２時間半です。
Speaker A (Male)
Speaker B (Female)
Speaker C (Female)
Speaker D (Senior Female)
The bullet train from Tokyo to Kanazawa takes approximately two and a half hours, making it the most convenient option for travel.
Speaker	Reference	Generated
Speaker E (Female)
Speaker F (Male)

Cross-lingual Zero-shot

Reference	Generated
English Speaker to Japanese Generation 東京から金沢までは新幹線を利用するのが便利で、所要時間は約２時間半です。

Japanese Speaker to English Generation The bullet train from Tokyo to Kanazawa takes approximately two and a half hours, making it the most convenient option for travel.
Reference	Generated

Code Switching

Reference	Generated
Mixed Japanese-English Sentence 最新のAI technologies、特にlarge language modelsは、音声合成の分野に大きなRevolutionをもたらしています。