Llasa-3B

HKUSTAudio

Advanced text-to-speech model extending LLaMA architecture with 3B parameters, trained on 250K hours of Chinese-English data. Supports both direct text and speech-prompted synthesis.

Property	Value
Author	HKUSTAudio
Model Size	3 Billion Parameters
License	CC BY-NC 4.0
Training Data	250,000 hours Chinese-English Speech

What is Llasa-3B?

Llasa-3B is an innovative text-to-speech (TTS) system that builds upon the LLaMA language model architecture. It integrates XCodec2 codebook containing 65,536 speech tokens, enabling high-quality speech synthesis in both Chinese and English. The model represents a significant advancement in neural TTS technology, seamlessly incorporating speech generation capabilities into the LLaMA framework.

Implementation Details

The model utilizes a sophisticated architecture that combines LLaMA's language understanding capabilities with speech token generation. It converts audio into single-codebook tokens, treating speech synthesis as a language modeling task. This approach enables compatibility with existing LLM optimization techniques, including compression, acceleration, and fine-tuning methods.

Integrated XCodec2 codebook with 65,536 tokens
Supports both direct text-to-speech and speech-prompted synthesis
Compatible with LLaMA framework optimizations
16kHz speech output support

Core Capabilities

Direct text-to-speech synthesis
Speech-prompted generation maintaining voice characteristics
Bilingual support (Chinese and English)
Configurable generation parameters (temperature, top-p sampling)

Frequently Asked Questions

Q: What makes this model unique?

Llasa-3B's unique approach lies in treating speech synthesis as a language modeling task, enabling seamless integration with LLM frameworks while maintaining high-quality speech output. The model's ability to handle both direct text input and speech prompts sets it apart from traditional TTS systems.

Q: What are the recommended use cases?

The model is ideal for applications requiring high-quality bilingual speech synthesis, including voice assistants, content creation, and accessibility tools. However, due to its CC BY-NC 4.0 license, it's restricted to non-commercial applications.