Baichuan-Audio-Instruct

Property	Value
Author	baichuan-inc
License	Apache 2.0
Model URL	https://huggingface.co/baichuan-inc/Baichuan-Audio-Instruct

What is Baichuan-Audio-Instruct?

Baichuan-Audio-Instruct is an innovative end-to-end speech interaction foundation model that combines audio processing with language understanding. It features a unique three-component architecture: Baichuan-Audio Tokenizer, Audio LLM, and a Flow-matching based Audio Decoder, enabling seamless switching between text and audio modalities.

Implementation Details

The model implements a sophisticated architecture operating at 12.5Hz frame rate. It utilizes the Whisper Large Encoder for audio feature extraction and employs an 8-layer RVQ for minimal information loss during quantization. The training process follows a two-stage strategy to maintain model intelligence while incorporating audio capabilities.

Audio tokenization using Whisper Large Encoder and 8-layer RVQ
Interleaved text and audio token generation
Flow-matching based decoder for high-quality 24 kHz audio output
Two-stage training approach to preserve language model capabilities

Core Capabilities

End-to-end speech interaction and processing
Seamless switching between text and audio modalities
High-quality audio generation through flow-matching
Comprehensive audio understanding benchmarked through OpenAudioBench
Processing of 142k hours of ITTS and 393k hours of INTLV data

Frequently Asked Questions

Q: What makes this model unique?

The model's distinctive feature is its ability to handle both text and audio modalities in an interleaved manner, using special tokens for seamless transitions. It also employs a novel two-stage training strategy to preserve language model intelligence while adding audio capabilities.

Q: What are the recommended use cases?

The model is well-suited for applications requiring audio-text interaction, speech synthesis, and comprehensive audio understanding tasks. It excels in scenarios requiring natural transitions between spoken and written communication.