ultravox-v0_4

ultravox-v0_4

fixie-ai

Multimodal Speech LLM combining Llama3.1-8B-Instruct and Whisper-medium for speech/text processing. Achieves 4.45% WER on LibriSpeech with ~50-100 tokens/sec generation.

PropertyValue
DeveloperFixie.ai
LicenseMIT
Base ModelsLlama3.1-8B-Instruct, Whisper-medium
Repositoryhttps://ultravox.ai

What is ultravox-v0_4?

Ultravox-v0_4 is a cutting-edge multimodal Speech Language Model that combines the power of Llama3.1-8B-Instruct and Whisper-medium to process both speech and text inputs. This innovative model can understand spoken language and generate appropriate text responses, making it ideal for voice-based applications and speech analysis tasks.

Implementation Details

The model utilizes a frozen Llama3.1-8B-Instruct backbone and Whisper-medium encoder, with only the multi-modal adapter being trained. It processes input through a special <|audio|> pseudo-token that gets replaced with audio-derived embeddings. Training was conducted using BF16 mixed precision on 8x H100 GPUs, employing a knowledge-distillation loss to match the text-based Llama backbone's logits.

  • Time-to-first-token (TTFT): ~150ms
  • Generation speed: 50-100 tokens/second on A100-40GB GPU
  • Improved WER: 4.45% on LibriSpeech clean test
  • Enhanced translation performance: BLEU scores of 25.47 (en_de) and 37.11 (es_en)

Core Capabilities

  • Speech-to-text processing with high accuracy
  • Multilingual translation support
  • Voice agent functionality
  • Speech analysis and understanding
  • Text and speech input processing

Frequently Asked Questions

Q: What makes this model unique?

Ultravox-v0_4 stands out for its ability to process both speech and text inputs in a unified framework, while achieving state-of-the-art performance in speech recognition and translation tasks. The model's architecture, combining Llama3.1 and Whisper-medium, enables efficient processing with relatively low latency.

Q: What are the recommended use cases?

The model is ideal for applications requiring voice agent capabilities, speech-to-speech translation, spoken audio analysis, and general conversational AI tasks. It can be easily integrated into applications using standard Python libraries like transformers, peft, and librosa.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026