ultravox-v0_4_1-mistral-nemo

Maintained By
fixie-ai

Ultravox-v0_4_1-mistral-nemo

PropertyValue
Parameter Count52.4M
LicenseMIT
Tensor TypeBF16
Languages Supported15
Repositoryultravox.ai

What is ultravox-v0_4_1-mistral-nemo?

Ultravox is an advanced multimodal Speech LLM that combines the power of Mistral-Nemo-Instruct-2407 and whisper-large-v3-turbo backbones. This innovative model can process both speech and text inputs, making it particularly versatile for various applications. It operates using a special <|audio|> pseudo-token system that seamlessly integrates audio embeddings with text processing.

Implementation Details

The model utilizes a sophisticated architecture where only the multi-modal adapter is trained while keeping the Whisper encoder and Mistral components frozen. It employs knowledge-distillation loss for training, where it aims to match the logits of the text-based Mistral backbone. Training was conducted using BF16 mixed precision on 8x H100 GPUs.

  • Time-to-first-token (TTFT): ~150ms
  • Processing speed: 50-100 tokens/second on A100-40GB GPU
  • Trained on 7 diverse datasets including LibriSpeech ASR and Common Voice

Core Capabilities

  • Multimodal processing of speech and text inputs
  • Support for 15 languages including English, Chinese, Arabic, and more
  • Speech-to-speech translation capabilities
  • Voice agent functionality
  • Spoken audio analysis

Frequently Asked Questions

Q: What makes this model unique?

The model's ability to process both speech and text inputs through a unified architecture, combined with its support for 15 languages and fast processing speed, makes it stand out. The special <|audio|> pseudo-token system allows for seamless integration of audio and text processing.

Q: What are the recommended use cases?

The model is ideal for voice agent applications, speech-to-speech translation, spoken audio analysis, and any application requiring multimodal processing of speech and text. It's particularly useful in multilingual environments due to its extensive language support.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.