OmniLMM-12B
Property | Value |
---|---|
Model Size | 11.6B parameters |
License | Apache-2.0 (code), Custom License for Parameters |
Primary Task | Visual Question Answering |
Architecture | EVA02-5B + Zephyr-7B-β with Perceiver Resampler |
What is OmniLMM-12B?
OmniLMM-12B is a state-of-the-art multimodal language model that combines visual and language processing capabilities. Built on the foundation of EVA02-5B and Zephyr-7B-β, it's designed to handle complex visual-language tasks with high accuracy and reliability. The model stands out for its RLHF alignment, making it particularly trustworthy in its responses compared to other LMMs.
Implementation Details
The model architecture integrates EVA02-5B and Zephyr-7B-β through a perceiver resampler layer, trained on multimodal data using a curriculum learning approach. It achieves impressive benchmarks, including leading performance on MME (1637 score), MMBench (71.6%), and Object HalBench (90.3%/95.5%).
- Multimodal RLHF alignment for reduced hallucination
- Real-time processing of video and speech streams
- Curriculum-based training on diverse multimodal datasets
Core Capabilities
- Strong performance on multiple multimodal benchmarks
- Trustworthy behavior with minimal hallucination
- Real-time multimodal interaction with camera and microphone inputs
- Outperforms GPT-4V on Object HalBench
- Rich multi-modal world knowledge understanding
Frequently Asked Questions
Q: What makes this model unique?
OmniLMM-12B is the first state-of-the-art open-source LMM aligned via multimodal RLHF for trustworthy behavior, significantly reducing hallucination issues common in other models. It achieves this while maintaining competitive performance across various benchmarks.
Q: What are the recommended use cases?
The model excels in visual question answering, real-time multimodal interaction, and tasks requiring high accuracy in image understanding. It's particularly suitable for applications where factual grounding and reduced hallucination are crucial.