sarashina2.2-tts

GitHub Demo Paper

sarashina2.2-tts is a Japanese-centric text-to-speech system built on a large language model, developed by SB Intuitions. It supports Japanese and English, delivering high pronunciation accuracy, naturalness, and stability across diverse speaking styles, with zero-shot voice generation support.

Highlights

  • 🇯🇵 Japanese-Centric: Designed and optimized specifically for Japanese, with broad coverage of real-world use cases.
  • 🎯 High Accuracy: Delivers strong pronunciation accuracy on Japanese text through large-scale end-to-end training.
  • 🔒 Responsibly Sourced Training Data: Trained exclusively on legitimately acquired and properly licensed speech data.
  • 🎙️ Zero-shot Voice Generation: Reproduces a speaker's voice, speaking style, and acoustic characteristics from a short reference clip.
  • 🔊 Natural & Expressive: Produces highly natural speech with consistent quality, supporting a wide range of speaking styles including narration, broadcast, conversation, and customer service.
  • 🌐 Bilingual: Supports both Japanese and English text-to-speech synthesis.

Training Data

This model was trained on audio data collected from legitimately purchased audio sources, public speech archives, and data gathered in compliance with applicable domestic laws. During collection, we adhered to robots.txt directives and terms of service to ensure proper data acquisition.

Usage

For installation instructions, Docker setup, and detailed usage, please refer to the GitHub repository.

Audio Samples

The samples below demonstrate key capabilities of sarashina2.2-tts:

  • Speaking Style Variety: Transfers diverse speaking styles — narration, broadcast, conversation, customer service, and more — from reference audio.
  • Zero-shot Voice Cloning: Reproduces a speaker's voice from just a few seconds of reference speech, with no fine-tuning required.
  • Cross-lingual Generation: Preserves speaker identity and speaking style consistently across Japanese and English.
  • Code Switching: Handles mixed Japanese-English sentences naturally within a single utterance.

Quick Sample

Zero-shot Speaker Adaptation
東京から金沢までは新幹線を利用するのが便利で、所要時間は約2時間半です。
ReferenceGenerated
Diverse Speaking Styles
お待たせいたしました。お客様のSoftBank光のご契約状況が確認できました。あわせて、Y!mobileとのおうち割 光セットの適用状況をお調べしたいのですが、現在お使いの携帯電話番号をお伺いしてもよろしいでしょうか?
ReferenceGenerated
English Generation
There is something remarkable about the way language shapes the way we think. A single phrase, spoken in the right tone, can carry emotions that words alone cannot express.
ReferenceGenerated

Speaking Style Variety

Narration
午前2時。東京・下町の一角。静まり返った住宅街に、リズミカルに包丁を叩く音が響く。店主の佐藤は、この場所で40年、変わらずにスープを炊き続けてきた。
ReferenceGenerated
Broadcast
国土交通省は15日、過疎地域や山間部における配送ルートの認可プロセスを簡略化する新指針を発表した。これにより、従来は数ヶ月を要していた飛行許可の申請期間が大幅に短縮される見通しだ。
ReferenceGenerated
Conversation
なるほど。じゃーちょっとすいませんそのー最近ハマってることについて、もう少しだけお話していただいてもいいですか?
ReferenceGenerated
Customer Service
お待たせいたしました。ご契約状況を確認したのですが、一点だけ補足で伺わせてください。ご登録いただいているお電話番号の下4桁、もしくはご生年月日を念のためお伺いしてもよろしいでしょうか?
ReferenceGenerated
Rakugo
「こんな安月給でやってられっかい、仕事なんかもう辞めたらあ!」て酒場で管巻いとったおっさんがな。翌朝になったら、誰よりもはよ店出て、鼻歌まじりに丁稚使いよる。
ReferenceGenerated

Zero-shot Voice Generation

東京から金沢までは新幹線を利用するのが便利で、所要時間は約2時間半です。
SpeakerReferenceGenerated
Speaker A (Male)
Speaker B (Female)
Speaker C (Female)
Speaker D (Senior Female)
The bullet train from Tokyo to Kanazawa takes approximately two and a half hours, making it the most convenient option for travel.
SpeakerReferenceGenerated
Speaker E (Female)
Speaker F (Male)

Cross-lingual Zero-shot

English Speaker to Japanese Generation
東京から金沢までは新幹線を利用するのが便利で、所要時間は約2時間半です。
ReferenceGenerated
Japanese Speaker to English Generation
The bullet train from Tokyo to Kanazawa takes approximately two and a half hours, making it the most convenient option for travel.
ReferenceGenerated

Code Switching

Mixed Japanese-English Sentence
最新のAI technologies、特にlarge language modelsは、音声合成の分野に大きなRevolutionをもたらしています。
ReferenceGenerated

Acknowledgments

This model is built upon or incorporates code and models from the following open-source projects:

License

This model is licensed under Sarashina Model NonCommercial License Agreement.

If you are interested in using this model for commercial purposes, please feel free to contact us through our contact page.

The audio provided on this page are for research purposes only and may not be redistributed or used for commercial purposes. 

Downloads last month
19
Safetensors
Model size
0.8B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sbintuitions/sarashina2.2-tts