Llasa-1B
Property | Value |
---|---|
Developer | HKUSTAudio |
License | CC BY-NC 4.0 |
Training Data | 250,000 hours Chinese-English speech |
Codebook Size | 65,536 tokens |
What is Llasa-1B?
Llasa-1B is an innovative text-to-speech synthesis model that builds upon the LLaMA language model architecture. It integrates speech capabilities by incorporating XCodec2 codebook tokens, enabling high-quality speech generation in both Chinese and English. The model represents a significant advancement in multilingual speech synthesis technology.
Implementation Details
The model architecture extends the base LLaMA-1B model by incorporating speech tokens from the XCodec2 codebook. It can generate speech either directly from text input or by utilizing speech prompts, making it versatile for various applications. The implementation supports both direct text-to-speech conversion and voice cloning capabilities.
- Built on LLaMA architecture with speech token integration
- Uses XCodec2 codebook with 65,536 unique speech tokens
- Supports 16kHz audio output
- Implements both zero-shot and prompt-based speech synthesis
Core Capabilities
- Direct text-to-speech synthesis in Chinese and English
- Voice cloning through speech prompts
- High-quality speech generation with controllable parameters
- Flexible deployment with adjustable inference settings
Frequently Asked Questions
Q: What makes this model unique?
Llasa-1B uniquely combines LLaMA's language understanding capabilities with XCodec2's speech tokenization, enabling high-quality multilingual speech synthesis with optional voice cloning features. The model's ability to handle both Chinese and English makes it particularly versatile.
Q: What are the recommended use cases?
The model is ideal for applications requiring high-quality text-to-speech conversion in Chinese or English, particularly when voice consistency or cloning is needed. However, commercial use is prohibited under the CC BY-NC 4.0 license.