Baichuan-Audio-Instruct
Property | Value |
---|---|
Author | baichuan-inc |
License | Apache 2.0 |
Model URL | https://huggingface.co/baichuan-inc/Baichuan-Audio-Instruct |
What is Baichuan-Audio-Instruct?
Baichuan-Audio-Instruct is an innovative end-to-end speech interaction foundation model that combines audio processing with language understanding. It features a unique three-component architecture: Baichuan-Audio Tokenizer, Audio LLM, and a Flow-matching based Audio Decoder, enabling seamless switching between text and audio modalities.
Implementation Details
The model implements a sophisticated architecture operating at 12.5Hz frame rate. It utilizes the Whisper Large Encoder for audio feature extraction and employs an 8-layer RVQ for minimal information loss during quantization. The training process follows a two-stage strategy to maintain model intelligence while incorporating audio capabilities.
- Audio tokenization using Whisper Large Encoder and 8-layer RVQ
- Interleaved text and audio token generation
- Flow-matching based decoder for high-quality 24 kHz audio output
- Two-stage training approach to preserve language model capabilities
Core Capabilities
- End-to-end speech interaction and processing
- Seamless switching between text and audio modalities
- High-quality audio generation through flow-matching
- Comprehensive audio understanding benchmarked through OpenAudioBench
- Processing of 142k hours of ITTS and 393k hours of INTLV data
Frequently Asked Questions
Q: What makes this model unique?
The model's distinctive feature is its ability to handle both text and audio modalities in an interleaved manner, using special tokens for seamless transitions. It also employs a novel two-stage training strategy to preserve language model intelligence while adding audio capabilities.
Q: What are the recommended use cases?
The model is well-suited for applications requiring audio-text interaction, speech synthesis, and comprehensive audio understanding tasks. It excels in scenarios requiring natural transitions between spoken and written communication.