ultravox-v0_4_1-mistral-nemo

ultravox-v0_4_1-mistral-nemo

fixie-ai

Multimodal Speech LLM combining Mistral-Nemo and Whisper for speech/text processing. 52.4M params, supports 15 languages, MIT license.

PropertyValue
Parameter Count52.4M
LicenseMIT
Tensor TypeBF16
Languages Supported15
Repositoryultravox.ai

What is ultravox-v0_4_1-mistral-nemo?

Ultravox is an advanced multimodal Speech LLM that combines the power of Mistral-Nemo-Instruct-2407 and whisper-large-v3-turbo backbones. This innovative model can process both speech and text inputs, making it particularly versatile for various applications. It operates using a special <|audio|> pseudo-token system that seamlessly integrates audio embeddings with text processing.

Implementation Details

The model utilizes a sophisticated architecture where only the multi-modal adapter is trained while keeping the Whisper encoder and Mistral components frozen. It employs knowledge-distillation loss for training, where it aims to match the logits of the text-based Mistral backbone. Training was conducted using BF16 mixed precision on 8x H100 GPUs.

  • Time-to-first-token (TTFT): ~150ms
  • Processing speed: 50-100 tokens/second on A100-40GB GPU
  • Trained on 7 diverse datasets including LibriSpeech ASR and Common Voice

Core Capabilities

  • Multimodal processing of speech and text inputs
  • Support for 15 languages including English, Chinese, Arabic, and more
  • Speech-to-speech translation capabilities
  • Voice agent functionality
  • Spoken audio analysis

Frequently Asked Questions

Q: What makes this model unique?

The model's ability to process both speech and text inputs through a unified architecture, combined with its support for 15 languages and fast processing speed, makes it stand out. The special <|audio|> pseudo-token system allows for seamless integration of audio and text processing.

Q: What are the recommended use cases?

The model is ideal for voice agent applications, speech-to-speech translation, spoken audio analysis, and any application requiring multimodal processing of speech and text. It's particularly useful in multilingual environments due to its extensive language support.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026