Molmo-72B-0924

Molmo-72B-0924

allenai

Advanced 73.3B parameter multimodal AI model from Allen AI that excels at image-text tasks, achieving top academic benchmark scores and near GPT-4 performance.

PropertyValue
Parameter Count73.3B
Model TypeImage-Text-to-Text Multimodal
Base ArchitectureQwen2-72B + CLIP Vision
LicenseApache 2.0
PaperResearch Paper

What is Molmo-72B-0924?

Molmo-72B-0924 is a state-of-the-art multimodal AI model developed by Allen Institute for AI that combines advanced vision and language capabilities. Built on Qwen2-72B and using OpenAI's CLIP as its vision backbone, it achieves the highest academic benchmark score among open-source models and ranks second in human evaluation, just behind GPT-4o. The model was trained on PixMo, a carefully curated dataset of 1 million image-text pairs.

Implementation Details

The model leverages a sophisticated architecture combining transformer-based language processing with advanced visual understanding capabilities. It supports both float32 and bfloat16 computation modes for different performance/memory trade-offs, and includes built-in support for efficient inference through PyTorch's autocast feature.

  • Supports various image formats with RGB conversion handling
  • Includes transparent image processing capabilities
  • Offers flexible memory optimization options
  • Implements efficient batch processing for multiple inputs

Core Capabilities

  • Achieves 81.2% average score on 11 academic benchmarks
  • Excels in human preference evaluations with 1077 Elo rating
  • Handles complex image-text understanding tasks
  • Supports long-form text generation up to 200 tokens
  • Processes high-resolution images up to 336px patches

Frequently Asked Questions

Q: What makes this model unique?

Molmo-72B-0924 stands out for its exceptional performance in academic benchmarks and human evaluations, surpassing many closed-source alternatives while maintaining full open-source availability. Its architecture combines the best of Qwen2-72B and CLIP, making it particularly effective for multimodal tasks.

Q: What are the recommended use cases?

The model excels in image description, visual question answering, and complex multimodal reasoning tasks. It's particularly well-suited for research and educational applications, with specific strengths in handling charts, infographics, and detailed visual analysis.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026