Ultravox-v0_4

Property	Value
Developer	Fixie.ai
License	MIT
Base Models	Llama3.1-8B-Instruct, Whisper-medium
Repository	https://ultravox.ai

What is ultravox-v0_4?

Ultravox-v0_4 is a cutting-edge multimodal Speech Language Model that combines the power of Llama3.1-8B-Instruct and Whisper-medium to process both speech and text inputs. This innovative model can understand spoken language and generate appropriate text responses, making it ideal for voice-based applications and speech analysis tasks.

Implementation Details

The model utilizes a frozen Llama3.1-8B-Instruct backbone and Whisper-medium encoder, with only the multi-modal adapter being trained. It processes input through a special <|audio|> pseudo-token that gets replaced with audio-derived embeddings. Training was conducted using BF16 mixed precision on 8x H100 GPUs, employing a knowledge-distillation loss to match the text-based Llama backbone's logits.

Time-to-first-token (TTFT): ~150ms
Generation speed: 50-100 tokens/second on A100-40GB GPU
Improved WER: 4.45% on LibriSpeech clean test
Enhanced translation performance: BLEU scores of 25.47 (en_de) and 37.11 (es_en)

Core Capabilities

Speech-to-text processing with high accuracy
Multilingual translation support
Voice agent functionality
Speech analysis and understanding
Text and speech input processing

Frequently Asked Questions

Q: What makes this model unique?

Ultravox-v0_4 stands out for its ability to process both speech and text inputs in a unified framework, while achieving state-of-the-art performance in speech recognition and translation tasks. The model's architecture, combining Llama3.1 and Whisper-medium, enables efficient processing with relatively low latency.

Q: What are the recommended use cases?

The model is ideal for applications requiring voice agent capabilities, speech-to-speech translation, spoken audio analysis, and general conversational AI tasks. It can be easily integrated into applications using standard Python libraries like transformers, peft, and librosa.

ultravox-v0_4