Step-Audio-TTS-3B

stepfun-ai

Step-Audio-TTS-3B is a groundbreaking text-to-speech model trained on synthetic data, supporting multiple languages, emotions, and RAP/humming generation with SOTA performance.

Property	Value
Model Size	3B parameters
Model Type	Text-to-Speech (TTS)
Architecture	Dual-codebook LLM
Hugging Face	stepfun-ai/Step-Audio-TTS-3B

What is Step-Audio-TTS-3B?

Step-Audio-TTS-3B represents a breakthrough in text-to-speech technology as the industry's first TTS model trained using the LLM-Chat paradigm on a large-scale synthetic dataset. It achieves state-of-the-art performance in Character Error Rate (CER) on the SEED TTS Eval benchmark, outperforming existing models like GLM-4-Voice and MinMo with a CER of 1.53% for Chinese and 2.71% for English.

Implementation Details

The model utilizes a sophisticated dual-codebook architecture comprising two main components: a dual-codebook trained LLM for text-to-speech synthesis and specialized vocoders for both standard speech and humming generation. This unique architecture enables high-quality speech synthesis while maintaining excellent content consistency.

Dual-codebook backbone with specialized vocoder implementation
Advanced synthetic dataset training methodology
Optimized performance for both Chinese and English languages
Industry-leading CER and WER metrics

Core Capabilities

Multi-language support with superior performance in Chinese and English
Diverse emotional expression control
Voice style customization
Unique RAP and Humming generation capabilities
Superior content consistency compared to existing models

Frequently Asked Questions

Q: What makes this model unique?

Step-Audio-TTS-3B is the first TTS model trained using the LLM-Chat paradigm and the first capable of generating RAP and Humming. It achieves industry-leading performance metrics while offering unprecedented versatility in voice generation capabilities.

Q: What are the recommended use cases?

The model is ideal for applications requiring high-quality speech synthesis, including multilingual content creation, emotional voice generation, RAP production, and humming synthesis. It's particularly effective for scenarios demanding high accuracy and natural-sounding speech output.