idefics-80b

Maintained By
HuggingFaceM4

IDEFICS-80B

PropertyValue
Parameter Count80 Billion
Model TypeMultimodal (Image + Text)
ArchitectureBased on Flamingo architecture with CLIP and LLaMA
LicenseMIT (new weights) + LLaMA License
Training DataOBELICS, Wikipedia, LAION, PMD

What is IDEFICS-80B?

IDEFICS-80B is a groundbreaking open-access multimodal model that serves as a reproduction of DeepMind's Flamingo model. It can process both images and text, enabling sophisticated interactions between visual and textual content. The model excels at tasks like visual question answering, image captioning, and creating stories based on multiple images.

Implementation Details

Built on two powerful foundation models - CLIP ViT-H-14 for vision and LLaMA-65B for language processing - IDEFICS-80B introduces newly initialized parameters to bridge these models. The architecture employs Transformer blocks with specialized components like Perceiver Resampler and cross-attention mechanisms.

  • Training utilized 512 NVIDIA A100 GPUs over 28 days
  • Implements mixed-precision BF16 training with DeepSpeed ZeRO-3
  • Trained on a diverse dataset including OBELICS (73.85%), LAION (17.18%), Wikipedia (6.15%), and PMD (2.82%)

Core Capabilities

  • Visual Question Answering with 63.6% accuracy on VQAv2 (4-shot)
  • Image Captioning with strong CIDER scores (110.3 on COCO)
  • Multi-image story generation and description
  • Zero-shot and few-shot learning across various visual tasks
  • Pure language model functionality without visual inputs

Frequently Asked Questions

Q: What makes this model unique?

IDEFICS-80B is one of the largest open-access multimodal models available, matching the performance of closed-source alternatives while being built entirely on publicly available data and models.

Q: What are the recommended use cases?

The model excels at visual question answering, image captioning, and creating narratives from multiple images. It's particularly suited for research and development in multimodal AI applications, though it should not be used for critical decision-making or high-stakes scenarios.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.