Baichuan-Audio-Base

Maintained By
baichuan-inc

Baichuan-Audio-Base

PropertyValue
Authorbaichuan-inc
LicenseApache 2.0
Model URLHuggingFace

What is Baichuan-Audio-Base?

Baichuan-Audio-Base is a groundbreaking end-to-end speech interaction foundation model that integrates audio processing, language understanding, and speech generation capabilities. The model employs a sophisticated architecture comprising three main components: the Baichuan-Audio Tokenizer, Audio LLM, and a Flow-matching based Audio Decoder.

Implementation Details

The model utilizes a 12.5hz frame rate design and incorporates the Whisper Large Encoder for feature extraction. It employs an 8-layer RVQ system for minimal information loss during quantization, while using both Mel spectrogram reconstruction and Pre-trained LLM supervision for comprehensive audio processing.

  • Tokenizer: Uses Whisper Large Encoder with 8-layer RVQ quantization
  • Audio LLM: Generates interleaved text and audio tokens
  • Decoder: Flow-matching based system for high-quality 24 kHz audio output
  • Training Data: 142k hours of ITTS and 393k hours of INTLV data

Core Capabilities

  • End-to-end speech processing and generation
  • Seamless modality switching between text and audio
  • High-quality Mel spectrogram reconstruction
  • Comprehensive audio understanding and generation
  • Support for both ASR and TTS tasks

Frequently Asked Questions

Q: What makes this model unique?

The model's distinctive feature is its two-stage training strategy that preserves pre-trained text knowledge while incorporating audio capabilities. It also uses a novel interleaved approach for text and audio token generation.

Q: What are the recommended use cases?

The model is ideal for speech recognition, text-to-speech conversion, audio understanding tasks, and applications requiring seamless switching between text and audio modalities. It's particularly well-suited for applications requiring high-quality audio output at 24 kHz.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.