IDEFICS-80B

Property	Value
Parameter Count	80 Billion
Model Type	Multimodal (Image + Text)
Architecture	Based on Flamingo architecture with CLIP and LLaMA
License	MIT (new weights) + LLaMA License
Training Data	OBELICS, Wikipedia, LAION, PMD

What is IDEFICS-80B?

IDEFICS-80B is a groundbreaking open-access multimodal model that serves as a reproduction of DeepMind's Flamingo model. It can process both images and text, enabling sophisticated interactions between visual and textual content. The model excels at tasks like visual question answering, image captioning, and creating stories based on multiple images.

Implementation Details

Built on two powerful foundation models - CLIP ViT-H-14 for vision and LLaMA-65B for language processing - IDEFICS-80B introduces newly initialized parameters to bridge these models. The architecture employs Transformer blocks with specialized components like Perceiver Resampler and cross-attention mechanisms.

Training utilized 512 NVIDIA A100 GPUs over 28 days
Implements mixed-precision BF16 training with DeepSpeed ZeRO-3
Trained on a diverse dataset including OBELICS (73.85%), LAION (17.18%), Wikipedia (6.15%), and PMD (2.82%)

Core Capabilities

Visual Question Answering with 63.6% accuracy on VQAv2 (4-shot)
Image Captioning with strong CIDER scores (110.3 on COCO)
Multi-image story generation and description
Zero-shot and few-shot learning across various visual tasks
Pure language model functionality without visual inputs

Frequently Asked Questions

Q: What makes this model unique?

IDEFICS-80B is one of the largest open-access multimodal models available, matching the performance of closed-source alternatives while being built entirely on publicly available data and models.

Q: What are the recommended use cases?

The model excels at visual question answering, image captioning, and creating narratives from multiple images. It's particularly suited for research and development in multimodal AI applications, though it should not be used for critical decision-making or high-stakes scenarios.

idefics-80b