OmniLMM-12B

OmniLMM-12B

openbmb

A powerful 12B parameter multimodal LLM combining EVA02-5B and Zephyr-7B-β, excelling in visual-language tasks with RLHF alignment and real-time interaction capabilities.

PropertyValue
Model Size11.6B parameters
LicenseApache-2.0 (code), Custom License for Parameters
Primary TaskVisual Question Answering
ArchitectureEVA02-5B + Zephyr-7B-β with Perceiver Resampler

What is OmniLMM-12B?

OmniLMM-12B is a state-of-the-art multimodal language model that combines visual and language processing capabilities. Built on the foundation of EVA02-5B and Zephyr-7B-β, it's designed to handle complex visual-language tasks with high accuracy and reliability. The model stands out for its RLHF alignment, making it particularly trustworthy in its responses compared to other LMMs.

Implementation Details

The model architecture integrates EVA02-5B and Zephyr-7B-β through a perceiver resampler layer, trained on multimodal data using a curriculum learning approach. It achieves impressive benchmarks, including leading performance on MME (1637 score), MMBench (71.6%), and Object HalBench (90.3%/95.5%).

  • Multimodal RLHF alignment for reduced hallucination
  • Real-time processing of video and speech streams
  • Curriculum-based training on diverse multimodal datasets

Core Capabilities

  • Strong performance on multiple multimodal benchmarks
  • Trustworthy behavior with minimal hallucination
  • Real-time multimodal interaction with camera and microphone inputs
  • Outperforms GPT-4V on Object HalBench
  • Rich multi-modal world knowledge understanding

Frequently Asked Questions

Q: What makes this model unique?

OmniLMM-12B is the first state-of-the-art open-source LMM aligned via multimodal RLHF for trustworthy behavior, significantly reducing hallucination issues common in other models. It achieves this while maintaining competitive performance across various benchmarks.

Q: What are the recommended use cases?

The model excels in visual question answering, real-time multimodal interaction, and tasks requiring high accuracy in image understanding. It's particularly suitable for applications where factual grounding and reduced hallucination are crucial.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026