Megrez-3B-Omni

Property	Value
Total Parameters	4B
License	Apache-2.0
Model Type	Multi-modal (Text, Image, Audio)
Context Length	4K tokens
Languages	Chinese & English
Model URL	https://huggingface.co/Infinigence/Megrez-3B-Omni

What is Megrez-3B-Omni?

Megrez-3B-Omni is a groundbreaking multi-modal model developed by Infinigence AI that combines text, image, and audio understanding capabilities. Built upon the Megrez-3B-Instruct foundation, it achieves state-of-the-art performance across multiple benchmarks while maintaining a relatively compact size of 4B parameters.

Implementation Details

The model architecture integrates three specialized modules: a Llama-2 with GQA for language processing (2.29B params), SigLip-SO400M for vision (0.42B params), and Whisper-large-v3 encoder for audio processing (0.64B params). These components are seamlessly connected through cross-attention and linear layers.

Image understanding capability surpasses larger models on OpenCompass with 66.2 average score
Maintains strong language processing with minimal degradation (< 2%) compared to text-only version
Supports real-time audio transcription and multi-turn conversations
Optimized for edge deployment

Core Capabilities

Superior image understanding across 8 mainstream benchmarks
State-of-the-art performance in OCR and scene understanding tasks
Competitive performance on language benchmarks including C-EVAL and MMLU
Advanced audio processing with support for both Chinese and English
Fast inference speed with 6312 tokens/s for prefill and 1294 tokens/s for decode on H100

Frequently Asked Questions

Q: What makes this model unique?

The model uniquely combines high performance across all three modalities (text, image, audio) while maintaining a relatively small parameter count. It achieves this through efficient architecture design and novel integration of specialized modules.

Q: What are the recommended use cases?

The model excels in multi-modal applications including image understanding, OCR, scene description, audio transcription, and general language tasks. It's particularly suited for edge deployment where resource efficiency is crucial while maintaining high performance.

Megrez-3B-Omni

Megrez-3B-Omni

What is Megrez-3B-Omni?

Implementation Details

Core Capabilities

Frequently Asked Questions

Q: What makes this model unique?

Q: What are the recommended use cases?

Related Models