Baichuan-Audio-Instruct

Maintained By
baichuan-inc

Baichuan-Audio-Instruct

PropertyValue
Authorbaichuan-inc
LicenseApache 2.0
Model URLhttps://huggingface.co/baichuan-inc/Baichuan-Audio-Instruct

What is Baichuan-Audio-Instruct?

Baichuan-Audio-Instruct is an innovative end-to-end speech interaction foundation model that combines audio processing with language understanding. It features a unique three-component architecture: Baichuan-Audio Tokenizer, Audio LLM, and a Flow-matching based Audio Decoder, enabling seamless switching between text and audio modalities.

Implementation Details

The model implements a sophisticated architecture operating at 12.5Hz frame rate. It utilizes the Whisper Large Encoder for audio feature extraction and employs an 8-layer RVQ for minimal information loss during quantization. The training process follows a two-stage strategy to maintain model intelligence while incorporating audio capabilities.

  • Audio tokenization using Whisper Large Encoder and 8-layer RVQ
  • Interleaved text and audio token generation
  • Flow-matching based decoder for high-quality 24 kHz audio output
  • Two-stage training approach to preserve language model capabilities

Core Capabilities

  • End-to-end speech interaction and processing
  • Seamless switching between text and audio modalities
  • High-quality audio generation through flow-matching
  • Comprehensive audio understanding benchmarked through OpenAudioBench
  • Processing of 142k hours of ITTS and 393k hours of INTLV data

Frequently Asked Questions

Q: What makes this model unique?

The model's distinctive feature is its ability to handle both text and audio modalities in an interleaved manner, using special tokens for seamless transitions. It also employs a novel two-stage training strategy to preserve language model intelligence while adding audio capabilities.

Q: What are the recommended use cases?

The model is well-suited for applications requiring audio-text interaction, speech synthesis, and comprehensive audio understanding tasks. It excels in scenarios requiring natural transitions between spoken and written communication.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.