Megrez-3B-Omni
Property | Value |
---|---|
Total Parameters | 4B |
License | Apache-2.0 |
Model Type | Multi-modal (Text, Image, Audio) |
Context Length | 4K tokens |
Languages | Chinese & English |
Model URL | https://huggingface.co/Infinigence/Megrez-3B-Omni |
What is Megrez-3B-Omni?
Megrez-3B-Omni is a groundbreaking multi-modal model developed by Infinigence AI that combines text, image, and audio understanding capabilities. Built upon the Megrez-3B-Instruct foundation, it achieves state-of-the-art performance across multiple benchmarks while maintaining a relatively compact size of 4B parameters.
Implementation Details
The model architecture integrates three specialized modules: a Llama-2 with GQA for language processing (2.29B params), SigLip-SO400M for vision (0.42B params), and Whisper-large-v3 encoder for audio processing (0.64B params). These components are seamlessly connected through cross-attention and linear layers.
- Image understanding capability surpasses larger models on OpenCompass with 66.2 average score
- Maintains strong language processing with minimal degradation (< 2%) compared to text-only version
- Supports real-time audio transcription and multi-turn conversations
- Optimized for edge deployment
Core Capabilities
- Superior image understanding across 8 mainstream benchmarks
- State-of-the-art performance in OCR and scene understanding tasks
- Competitive performance on language benchmarks including C-EVAL and MMLU
- Advanced audio processing with support for both Chinese and English
- Fast inference speed with 6312 tokens/s for prefill and 1294 tokens/s for decode on H100
Frequently Asked Questions
Q: What makes this model unique?
The model uniquely combines high performance across all three modalities (text, image, audio) while maintaining a relatively small parameter count. It achieves this through efficient architecture design and novel integration of specialized modules.
Q: What are the recommended use cases?
The model excels in multi-modal applications including image understanding, OCR, scene description, audio transcription, and general language tasks. It's particularly suited for edge deployment where resource efficiency is crucial while maintaining high performance.