ultravox-v0_4

Maintained By
fixie-ai

Ultravox-v0_4

PropertyValue
DeveloperFixie.ai
LicenseMIT
Base ModelsLlama3.1-8B-Instruct, Whisper-medium
Repositoryhttps://ultravox.ai

What is ultravox-v0_4?

Ultravox-v0_4 is a cutting-edge multimodal Speech Language Model that combines the power of Llama3.1-8B-Instruct and Whisper-medium to process both speech and text inputs. This innovative model can understand spoken language and generate appropriate text responses, making it ideal for voice-based applications and speech analysis tasks.

Implementation Details

The model utilizes a frozen Llama3.1-8B-Instruct backbone and Whisper-medium encoder, with only the multi-modal adapter being trained. It processes input through a special <|audio|> pseudo-token that gets replaced with audio-derived embeddings. Training was conducted using BF16 mixed precision on 8x H100 GPUs, employing a knowledge-distillation loss to match the text-based Llama backbone's logits.

  • Time-to-first-token (TTFT): ~150ms
  • Generation speed: 50-100 tokens/second on A100-40GB GPU
  • Improved WER: 4.45% on LibriSpeech clean test
  • Enhanced translation performance: BLEU scores of 25.47 (en_de) and 37.11 (es_en)

Core Capabilities

  • Speech-to-text processing with high accuracy
  • Multilingual translation support
  • Voice agent functionality
  • Speech analysis and understanding
  • Text and speech input processing

Frequently Asked Questions

Q: What makes this model unique?

Ultravox-v0_4 stands out for its ability to process both speech and text inputs in a unified framework, while achieving state-of-the-art performance in speech recognition and translation tasks. The model's architecture, combining Llama3.1 and Whisper-medium, enables efficient processing with relatively low latency.

Q: What are the recommended use cases?

The model is ideal for applications requiring voice agent capabilities, speech-to-speech translation, spoken audio analysis, and general conversational AI tasks. It can be easily integrated into applications using standard Python libraries like transformers, peft, and librosa.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.