Baichuan-Audio-Base

Property	Value
Author	baichuan-inc
License	Apache 2.0
Model URL	HuggingFace

What is Baichuan-Audio-Base?

Baichuan-Audio-Base is a groundbreaking end-to-end speech interaction foundation model that integrates audio processing, language understanding, and speech generation capabilities. The model employs a sophisticated architecture comprising three main components: the Baichuan-Audio Tokenizer, Audio LLM, and a Flow-matching based Audio Decoder.

Implementation Details

The model utilizes a 12.5hz frame rate design and incorporates the Whisper Large Encoder for feature extraction. It employs an 8-layer RVQ system for minimal information loss during quantization, while using both Mel spectrogram reconstruction and Pre-trained LLM supervision for comprehensive audio processing.

Tokenizer: Uses Whisper Large Encoder with 8-layer RVQ quantization
Audio LLM: Generates interleaved text and audio tokens
Decoder: Flow-matching based system for high-quality 24 kHz audio output
Training Data: 142k hours of ITTS and 393k hours of INTLV data

Core Capabilities

End-to-end speech processing and generation
Seamless modality switching between text and audio
High-quality Mel spectrogram reconstruction
Comprehensive audio understanding and generation
Support for both ASR and TTS tasks

Frequently Asked Questions

Q: What makes this model unique?

The model's distinctive feature is its two-stage training strategy that preserves pre-trained text knowledge while incorporating audio capabilities. It also uses a novel interleaved approach for text and audio token generation.

Q: What are the recommended use cases?

The model is ideal for speech recognition, text-to-speech conversion, audio understanding tasks, and applications requiring seamless switching between text and audio modalities. It's particularly well-suited for applications requiring high-quality audio output at 24 kHz.