ultravox-v0_3

fixie-ai

Ultravox v0.3 is an 8.06B parameter multimodal Speech LLM combining Llama3.1-8B-Instruct and Whisper-small for speech/text processing with MIT license.

Property	Value
Parameter Count	8.06B
Model Type	Multimodal Speech LLM
License	MIT
Tensor Type	BF16
Repository	https://ultravox.ai

What is ultravox-v0_3?

Ultravox v0.3 is an advanced multimodal Speech Language Model that combines the power of Llama3.1-8B-Instruct and Whisper-small architectures. It's designed to process both speech and text inputs seamlessly, making it a versatile tool for voice-based applications and natural language processing tasks.

Implementation Details

The model utilizes a frozen Llama3.1-8B-Instruct backbone and Whisper-small encoder, with only the multi-modal adapter being trained. It processes input through a special <|audio|> pseudo-token that gets replaced with audio-derived embeddings. Training was conducted using BF16 mixed precision on 8x H100 GPUs, achieving impressive performance metrics including a 200ms time-to-first-token and 50-100 tokens per second on an A100-40GB GPU.

Built on Llama3.1-8B-Instruct and Whisper-small backbone
Knowledge-distillation training approach
Multimodal processing capabilities
High-performance metrics (BLEU scores: 22.68 for en_de, 24.10 for es_en)

Core Capabilities

Speech and text input processing
Voice agent functionality
Speech-to-speech translation
Spoken audio analysis
Low latency response generation

Frequently Asked Questions

Q: What makes this model unique?

The model's ability to process both speech and text inputs through a unified architecture, combined with its impressive performance metrics and relatively small footprint for its capabilities, makes it stand out in the field of multimodal AI models.

Q: What are the recommended use cases?

Ultravox v0.3 is ideal for voice agent applications, speech-to-speech translation, audio analysis, and any scenario requiring both speech and text processing capabilities. It's particularly effective for interactive voice applications requiring quick response times.