Megrez-3B-Omni

Maintained By
Infinigence

Megrez-3B-Omni

PropertyValue
Total Parameters4B
LicenseApache-2.0
Model TypeMulti-modal (Text, Image, Audio)
Context Length4K tokens
LanguagesChinese & English
Model URLhttps://huggingface.co/Infinigence/Megrez-3B-Omni

What is Megrez-3B-Omni?

Megrez-3B-Omni is a groundbreaking multi-modal model developed by Infinigence AI that combines text, image, and audio understanding capabilities. Built upon the Megrez-3B-Instruct foundation, it achieves state-of-the-art performance across multiple benchmarks while maintaining a relatively compact size of 4B parameters.

Implementation Details

The model architecture integrates three specialized modules: a Llama-2 with GQA for language processing (2.29B params), SigLip-SO400M for vision (0.42B params), and Whisper-large-v3 encoder for audio processing (0.64B params). These components are seamlessly connected through cross-attention and linear layers.

  • Image understanding capability surpasses larger models on OpenCompass with 66.2 average score
  • Maintains strong language processing with minimal degradation (< 2%) compared to text-only version
  • Supports real-time audio transcription and multi-turn conversations
  • Optimized for edge deployment

Core Capabilities

  • Superior image understanding across 8 mainstream benchmarks
  • State-of-the-art performance in OCR and scene understanding tasks
  • Competitive performance on language benchmarks including C-EVAL and MMLU
  • Advanced audio processing with support for both Chinese and English
  • Fast inference speed with 6312 tokens/s for prefill and 1294 tokens/s for decode on H100

Frequently Asked Questions

Q: What makes this model unique?

The model uniquely combines high performance across all three modalities (text, image, audio) while maintaining a relatively small parameter count. It achieves this through efficient architecture design and novel integration of specialized modules.

Q: What are the recommended use cases?

The model excels in multi-modal applications including image understanding, OCR, scene description, audio transcription, and general language tasks. It's particularly suited for edge deployment where resource efficiency is crucial while maintaining high performance.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.