Llasa-1B

HKUSTAudio

Llasa-1B is a text-to-speech model extending LLaMA with 65,536 XCodec2 speech tokens, trained on 250K hours of Chinese-English data.

Property	Value
Developer	HKUSTAudio
License	CC BY-NC 4.0
Training Data	250,000 hours Chinese-English speech
Codebook Size	65,536 tokens

What is Llasa-1B?

Llasa-1B is an innovative text-to-speech synthesis model that builds upon the LLaMA language model architecture. It integrates speech capabilities by incorporating XCodec2 codebook tokens, enabling high-quality speech generation in both Chinese and English. The model represents a significant advancement in multilingual speech synthesis technology.

Implementation Details

The model architecture extends the base LLaMA-1B model by incorporating speech tokens from the XCodec2 codebook. It can generate speech either directly from text input or by utilizing speech prompts, making it versatile for various applications. The implementation supports both direct text-to-speech conversion and voice cloning capabilities.

Built on LLaMA architecture with speech token integration
Uses XCodec2 codebook with 65,536 unique speech tokens
Supports 16kHz audio output
Implements both zero-shot and prompt-based speech synthesis

Core Capabilities

Direct text-to-speech synthesis in Chinese and English
Voice cloning through speech prompts
High-quality speech generation with controllable parameters
Flexible deployment with adjustable inference settings

Frequently Asked Questions

Q: What makes this model unique?

Llasa-1B uniquely combines LLaMA's language understanding capabilities with XCodec2's speech tokenization, enabling high-quality multilingual speech synthesis with optional voice cloning features. The model's ability to handle both Chinese and English makes it particularly versatile.

Q: What are the recommended use cases?

The model is ideal for applications requiring high-quality text-to-speech conversion in Chinese or English, particularly when voice consistency or cloning is needed. However, commercial use is prohibited under the CC BY-NC 4.0 license.