Baichuan-Audio-Base
Property | Value |
---|---|
Author | baichuan-inc |
License | Apache 2.0 |
Model URL | HuggingFace |
What is Baichuan-Audio-Base?
Baichuan-Audio-Base is a groundbreaking end-to-end speech interaction foundation model that integrates audio processing, language understanding, and speech generation capabilities. The model employs a sophisticated architecture comprising three main components: the Baichuan-Audio Tokenizer, Audio LLM, and a Flow-matching based Audio Decoder.
Implementation Details
The model utilizes a 12.5hz frame rate design and incorporates the Whisper Large Encoder for feature extraction. It employs an 8-layer RVQ system for minimal information loss during quantization, while using both Mel spectrogram reconstruction and Pre-trained LLM supervision for comprehensive audio processing.
- Tokenizer: Uses Whisper Large Encoder with 8-layer RVQ quantization
- Audio LLM: Generates interleaved text and audio tokens
- Decoder: Flow-matching based system for high-quality 24 kHz audio output
- Training Data: 142k hours of ITTS and 393k hours of INTLV data
Core Capabilities
- End-to-end speech processing and generation
- Seamless modality switching between text and audio
- High-quality Mel spectrogram reconstruction
- Comprehensive audio understanding and generation
- Support for both ASR and TTS tasks
Frequently Asked Questions
Q: What makes this model unique?
The model's distinctive feature is its two-stage training strategy that preserves pre-trained text knowledge while incorporating audio capabilities. It also uses a novel interleaved approach for text and audio token generation.
Q: What are the recommended use cases?
The model is ideal for speech recognition, text-to-speech conversion, audio understanding tasks, and applications requiring seamless switching between text and audio modalities. It's particularly well-suited for applications requiring high-quality audio output at 24 kHz.