IDEFICS-80B
Property | Value |
---|---|
Parameter Count | 80 Billion |
Model Type | Multimodal (Image + Text) |
Architecture | Based on Flamingo architecture with CLIP and LLaMA |
License | MIT (new weights) + LLaMA License |
Training Data | OBELICS, Wikipedia, LAION, PMD |
What is IDEFICS-80B?
IDEFICS-80B is a groundbreaking open-access multimodal model that serves as a reproduction of DeepMind's Flamingo model. It can process both images and text, enabling sophisticated interactions between visual and textual content. The model excels at tasks like visual question answering, image captioning, and creating stories based on multiple images.
Implementation Details
Built on two powerful foundation models - CLIP ViT-H-14 for vision and LLaMA-65B for language processing - IDEFICS-80B introduces newly initialized parameters to bridge these models. The architecture employs Transformer blocks with specialized components like Perceiver Resampler and cross-attention mechanisms.
- Training utilized 512 NVIDIA A100 GPUs over 28 days
- Implements mixed-precision BF16 training with DeepSpeed ZeRO-3
- Trained on a diverse dataset including OBELICS (73.85%), LAION (17.18%), Wikipedia (6.15%), and PMD (2.82%)
Core Capabilities
- Visual Question Answering with 63.6% accuracy on VQAv2 (4-shot)
- Image Captioning with strong CIDER scores (110.3 on COCO)
- Multi-image story generation and description
- Zero-shot and few-shot learning across various visual tasks
- Pure language model functionality without visual inputs
Frequently Asked Questions
Q: What makes this model unique?
IDEFICS-80B is one of the largest open-access multimodal models available, matching the performance of closed-source alternatives while being built entirely on publicly available data and models.
Q: What are the recommended use cases?
The model excels at visual question answering, image captioning, and creating narratives from multiple images. It's particularly suited for research and development in multimodal AI applications, though it should not be used for critical decision-making or high-stakes scenarios.