Ultravox-v0_4
Property | Value |
---|---|
Developer | Fixie.ai |
License | MIT |
Base Models | Llama3.1-8B-Instruct, Whisper-medium |
Repository | https://ultravox.ai |
What is ultravox-v0_4?
Ultravox-v0_4 is a cutting-edge multimodal Speech Language Model that combines the power of Llama3.1-8B-Instruct and Whisper-medium to process both speech and text inputs. This innovative model can understand spoken language and generate appropriate text responses, making it ideal for voice-based applications and speech analysis tasks.
Implementation Details
The model utilizes a frozen Llama3.1-8B-Instruct backbone and Whisper-medium encoder, with only the multi-modal adapter being trained. It processes input through a special <|audio|> pseudo-token that gets replaced with audio-derived embeddings. Training was conducted using BF16 mixed precision on 8x H100 GPUs, employing a knowledge-distillation loss to match the text-based Llama backbone's logits.
- Time-to-first-token (TTFT): ~150ms
- Generation speed: 50-100 tokens/second on A100-40GB GPU
- Improved WER: 4.45% on LibriSpeech clean test
- Enhanced translation performance: BLEU scores of 25.47 (en_de) and 37.11 (es_en)
Core Capabilities
- Speech-to-text processing with high accuracy
- Multilingual translation support
- Voice agent functionality
- Speech analysis and understanding
- Text and speech input processing
Frequently Asked Questions
Q: What makes this model unique?
Ultravox-v0_4 stands out for its ability to process both speech and text inputs in a unified framework, while achieving state-of-the-art performance in speech recognition and translation tasks. The model's architecture, combining Llama3.1 and Whisper-medium, enables efficient processing with relatively low latency.
Q: What are the recommended use cases?
The model is ideal for applications requiring voice agent capabilities, speech-to-speech translation, spoken audio analysis, and general conversational AI tasks. It can be easily integrated into applications using standard Python libraries like transformers, peft, and librosa.