Ultravox v0.4.1 Llama 3.1 8B
Property | Value |
---|---|
Parameter Count | 50.3M |
License | MIT |
Supported Languages | 15 languages including English, Arabic, German, etc. |
Training Hardware | 8x H100 GPUs |
Format | BF16 |
What is ultravox-v0_4_1-llama-3_1-8b?
Ultravox is an advanced multimodal Speech LLM that combines the power of Llama 3.1-8B-Instruct and whisper-large-v3-turbo to process both speech and text inputs. Developed by Fixie.ai, it represents a significant advancement in multimodal AI processing, capable of handling both text system prompts and voice user messages.
Implementation Details
The model architecture utilizes a frozen Llama 3.1 8B backbone and Whisper encoder, with only the multi-modal adapter being trained. It processes input through a special <|audio|> pseudo-token that gets replaced with audio-derived embeddings. The model achieves impressive performance metrics, with a time-to-first-token of approximately 150ms and generates 50-100 tokens per second on an A100-40GB GPU.
- Knowledge-distillation training approach
- BF16 mixed precision training
- Integration with 7 major speech datasets
- Support for 15 different languages
Core Capabilities
- Speech-to-text translation across multiple languages
- Voice agent functionality
- Spoken audio analysis
- Multimodal processing of both text and speech inputs
- High-performance translation capabilities with BLEU scores ranging from 12.28 to 39.65 across different language pairs
Frequently Asked Questions
Q: What makes this model unique?
The model's ability to process both speech and text inputs seamlessly, combined with its support for 15 languages and efficient performance metrics, makes it particularly valuable for multilingual speech applications.
Q: What are the recommended use cases?
The model is ideal for voice agent applications, speech-to-speech translation, spoken audio analysis, and any scenario requiring multilingual speech understanding and processing.