ultravox-v0_4_1-mistral-nemo

fixie-ai

Multimodal Speech LLM combining Mistral-Nemo and Whisper for speech/text processing. 52.4M params, supports 15 languages, MIT license.

Property	Value
Parameter Count	52.4M
License	MIT
Tensor Type	BF16
Languages Supported	15
Repository	ultravox.ai

What is ultravox-v0_4_1-mistral-nemo?

Ultravox is an advanced multimodal Speech LLM that combines the power of Mistral-Nemo-Instruct-2407 and whisper-large-v3-turbo backbones. This innovative model can process both speech and text inputs, making it particularly versatile for various applications. It operates using a special <|audio|> pseudo-token system that seamlessly integrates audio embeddings with text processing.

Implementation Details

The model utilizes a sophisticated architecture where only the multi-modal adapter is trained while keeping the Whisper encoder and Mistral components frozen. It employs knowledge-distillation loss for training, where it aims to match the logits of the text-based Mistral backbone. Training was conducted using BF16 mixed precision on 8x H100 GPUs.

Time-to-first-token (TTFT): ~150ms
Processing speed: 50-100 tokens/second on A100-40GB GPU
Trained on 7 diverse datasets including LibriSpeech ASR and Common Voice

Core Capabilities

Multimodal processing of speech and text inputs
Support for 15 languages including English, Chinese, Arabic, and more
Speech-to-speech translation capabilities
Voice agent functionality
Spoken audio analysis

Frequently Asked Questions

Q: What makes this model unique?

The model's ability to process both speech and text inputs through a unified architecture, combined with its support for 15 languages and fast processing speed, makes it stand out. The special <|audio|> pseudo-token system allows for seamless integration of audio and text processing.

Q: What are the recommended use cases?

The model is ideal for voice agent applications, speech-to-speech translation, spoken audio analysis, and any application requiring multimodal processing of speech and text. It's particularly useful in multilingual environments due to its extensive language support.