OmniLMM-12B

openbmb

A powerful 12B parameter multimodal LLM combining EVA02-5B and Zephyr-7B-β, excelling in visual-language tasks with RLHF alignment and real-time interaction capabilities.

Property	Value
Model Size	11.6B parameters
License	Apache-2.0 (code), Custom License for Parameters
Primary Task	Visual Question Answering
Architecture	EVA02-5B + Zephyr-7B-β with Perceiver Resampler

What is OmniLMM-12B?

OmniLMM-12B is a state-of-the-art multimodal language model that combines visual and language processing capabilities. Built on the foundation of EVA02-5B and Zephyr-7B-β, it's designed to handle complex visual-language tasks with high accuracy and reliability. The model stands out for its RLHF alignment, making it particularly trustworthy in its responses compared to other LMMs.

Implementation Details

The model architecture integrates EVA02-5B and Zephyr-7B-β through a perceiver resampler layer, trained on multimodal data using a curriculum learning approach. It achieves impressive benchmarks, including leading performance on MME (1637 score), MMBench (71.6%), and Object HalBench (90.3%/95.5%).

Multimodal RLHF alignment for reduced hallucination
Real-time processing of video and speech streams
Curriculum-based training on diverse multimodal datasets

Core Capabilities

Strong performance on multiple multimodal benchmarks
Trustworthy behavior with minimal hallucination
Real-time multimodal interaction with camera and microphone inputs
Outperforms GPT-4V on Object HalBench
Rich multi-modal world knowledge understanding

Frequently Asked Questions

Q: What makes this model unique?

OmniLMM-12B is the first state-of-the-art open-source LMM aligned via multimodal RLHF for trustworthy behavior, significantly reducing hallucination issues common in other models. It achieves this while maintaining competitive performance across various benchmarks.

Q: What are the recommended use cases?

The model excels in visual question answering, real-time multimodal interaction, and tasks requiring high accuracy in image understanding. It's particularly suitable for applications where factual grounding and reduced hallucination are crucial.