Megrez-3B-Omni

Megrez-3B-Omni

Infinigence

Megrez-3B-Omni is a multi-modal language model supporting text, image, and audio understanding. Features 4B parameters, achieves SOTA performance on OpenCompass (66.2), and maintains strong language capabilities.

PropertyValue
Total Parameters4B
LicenseApache-2.0
Model TypeMulti-modal (Text, Image, Audio)
Context Length4K tokens
LanguagesChinese & English
Model URLhttps://huggingface.co/Infinigence/Megrez-3B-Omni

What is Megrez-3B-Omni?

Megrez-3B-Omni is a groundbreaking multi-modal model developed by Infinigence AI that combines text, image, and audio understanding capabilities. Built upon the Megrez-3B-Instruct foundation, it achieves state-of-the-art performance across multiple benchmarks while maintaining a relatively compact size of 4B parameters.

Implementation Details

The model architecture integrates three specialized modules: a Llama-2 with GQA for language processing (2.29B params), SigLip-SO400M for vision (0.42B params), and Whisper-large-v3 encoder for audio processing (0.64B params). These components are seamlessly connected through cross-attention and linear layers.

  • Image understanding capability surpasses larger models on OpenCompass with 66.2 average score
  • Maintains strong language processing with minimal degradation (< 2%) compared to text-only version
  • Supports real-time audio transcription and multi-turn conversations
  • Optimized for edge deployment

Core Capabilities

  • Superior image understanding across 8 mainstream benchmarks
  • State-of-the-art performance in OCR and scene understanding tasks
  • Competitive performance on language benchmarks including C-EVAL and MMLU
  • Advanced audio processing with support for both Chinese and English
  • Fast inference speed with 6312 tokens/s for prefill and 1294 tokens/s for decode on H100

Frequently Asked Questions

Q: What makes this model unique?

The model uniquely combines high performance across all three modalities (text, image, audio) while maintaining a relatively small parameter count. It achieves this through efficient architecture design and novel integration of specialized modules.

Q: What are the recommended use cases?

The model excels in multi-modal applications including image understanding, OCR, scene description, audio transcription, and general language tasks. It's particularly suited for edge deployment where resource efficiency is crucial while maintaining high performance.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026