ultravox-v0_3

ultravox-v0_3

fixie-ai

Ultravox v0.3 is an 8.06B parameter multimodal Speech LLM combining Llama3.1-8B-Instruct and Whisper-small for speech/text processing with MIT license.

PropertyValue
Parameter Count8.06B
Model TypeMultimodal Speech LLM
LicenseMIT
Tensor TypeBF16
Repositoryhttps://ultravox.ai

What is ultravox-v0_3?

Ultravox v0.3 is an advanced multimodal Speech Language Model that combines the power of Llama3.1-8B-Instruct and Whisper-small architectures. It's designed to process both speech and text inputs seamlessly, making it a versatile tool for voice-based applications and natural language processing tasks.

Implementation Details

The model utilizes a frozen Llama3.1-8B-Instruct backbone and Whisper-small encoder, with only the multi-modal adapter being trained. It processes input through a special <|audio|> pseudo-token that gets replaced with audio-derived embeddings. Training was conducted using BF16 mixed precision on 8x H100 GPUs, achieving impressive performance metrics including a 200ms time-to-first-token and 50-100 tokens per second on an A100-40GB GPU.

  • Built on Llama3.1-8B-Instruct and Whisper-small backbone
  • Knowledge-distillation training approach
  • Multimodal processing capabilities
  • High-performance metrics (BLEU scores: 22.68 for en_de, 24.10 for es_en)

Core Capabilities

  • Speech and text input processing
  • Voice agent functionality
  • Speech-to-speech translation
  • Spoken audio analysis
  • Low latency response generation

Frequently Asked Questions

Q: What makes this model unique?

The model's ability to process both speech and text inputs through a unified architecture, combined with its impressive performance metrics and relatively small footprint for its capabilities, makes it stand out in the field of multimodal AI models.

Q: What are the recommended use cases?

Ultravox v0.3 is ideal for voice agent applications, speech-to-speech translation, audio analysis, and any scenario requiring both speech and text processing capabilities. It's particularly effective for interactive voice applications requiring quick response times.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026