ultravox-v0_4_1-llama-3_1-8b

Maintained By
fixie-ai

Ultravox v0.4.1 Llama 3.1 8B

PropertyValue
Parameter Count50.3M
LicenseMIT
Supported Languages15 languages including English, Arabic, German, etc.
Training Hardware8x H100 GPUs
FormatBF16

What is ultravox-v0_4_1-llama-3_1-8b?

Ultravox is an advanced multimodal Speech LLM that combines the power of Llama 3.1-8B-Instruct and whisper-large-v3-turbo to process both speech and text inputs. Developed by Fixie.ai, it represents a significant advancement in multimodal AI processing, capable of handling both text system prompts and voice user messages.

Implementation Details

The model architecture utilizes a frozen Llama 3.1 8B backbone and Whisper encoder, with only the multi-modal adapter being trained. It processes input through a special <|audio|> pseudo-token that gets replaced with audio-derived embeddings. The model achieves impressive performance metrics, with a time-to-first-token of approximately 150ms and generates 50-100 tokens per second on an A100-40GB GPU.

  • Knowledge-distillation training approach
  • BF16 mixed precision training
  • Integration with 7 major speech datasets
  • Support for 15 different languages

Core Capabilities

  • Speech-to-text translation across multiple languages
  • Voice agent functionality
  • Spoken audio analysis
  • Multimodal processing of both text and speech inputs
  • High-performance translation capabilities with BLEU scores ranging from 12.28 to 39.65 across different language pairs

Frequently Asked Questions

Q: What makes this model unique?

The model's ability to process both speech and text inputs seamlessly, combined with its support for 15 languages and efficient performance metrics, makes it particularly valuable for multilingual speech applications.

Q: What are the recommended use cases?

The model is ideal for voice agent applications, speech-to-speech translation, spoken audio analysis, and any scenario requiring multilingual speech understanding and processing.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.