Ultravox v0.4.1 Llama 3.1 8B

Property	Value
Parameter Count	50.3M
License	MIT
Supported Languages	15 languages including English, Arabic, German, etc.
Training Hardware	8x H100 GPUs
Format	BF16

What is ultravox-v0_4_1-llama-3_1-8b?

Ultravox is an advanced multimodal Speech LLM that combines the power of Llama 3.1-8B-Instruct and whisper-large-v3-turbo to process both speech and text inputs. Developed by Fixie.ai, it represents a significant advancement in multimodal AI processing, capable of handling both text system prompts and voice user messages.

Implementation Details

The model architecture utilizes a frozen Llama 3.1 8B backbone and Whisper encoder, with only the multi-modal adapter being trained. It processes input through a special <|audio|> pseudo-token that gets replaced with audio-derived embeddings. The model achieves impressive performance metrics, with a time-to-first-token of approximately 150ms and generates 50-100 tokens per second on an A100-40GB GPU.

Knowledge-distillation training approach
BF16 mixed precision training
Integration with 7 major speech datasets
Support for 15 different languages

Core Capabilities

Speech-to-text translation across multiple languages
Voice agent functionality
Spoken audio analysis
Multimodal processing of both text and speech inputs
High-performance translation capabilities with BLEU scores ranging from 12.28 to 39.65 across different language pairs

Frequently Asked Questions

Q: What makes this model unique?

The model's ability to process both speech and text inputs seamlessly, combined with its support for 15 languages and efficient performance metrics, makes it particularly valuable for multilingual speech applications.

Q: What are the recommended use cases?

The model is ideal for voice agent applications, speech-to-speech translation, spoken audio analysis, and any scenario requiring multilingual speech understanding and processing.