bridgetower-large-itm-mlm-itc

Maintained By
BridgeTower

BridgeTower Large ITM-MLM-ITC Model

PropertyValue
LicenseMIT
PaperView Paper
Training Datasets5 (CC3M, CC12M, SBU, MSCOCO, Visual Genome)
FrameworkPyTorch

What is bridgetower-large-itm-mlm-itc?

BridgeTower is a groundbreaking vision-language model that introduces an innovative architecture featuring multiple bridge layers connecting uni-modal encoders with cross-modal encoders. The model achieves state-of-the-art performance on various vision-language tasks, notably reaching 78.73% accuracy on VQAv2 test-std set with only 4M images of pre-training data.

Implementation Details

The model is implemented using PyTorch and supports three main functionalities: contrastive learning between image-text pairs, image-text matching, and masked language modeling. It was pre-trained on a massive scale using 512 Gaudis and 128 Xeons with a 2048 batch size for 10 epochs.

  • Utilizes AdamW optimizer with 1e-7 learning rate
  • Image resolution: 294x294 pixels
  • Pre-trained on 14M unique images across 5 datasets
  • Implements bridge layers for effective bottom-up cross-modal alignment

Core Capabilities

  • Contrastive Learning between image and text pairs
  • Image and Text Matching
  • Masked Language Modeling
  • Cross-modal representation learning
  • Visual Question Answering

Frequently Asked Questions

Q: What makes this model unique?

BridgeTower's uniqueness lies in its bridge layers architecture that enables effective bottom-up cross-modal alignment between visual and textual representations at different semantic levels, achieving superior performance with significantly less pre-training data than competitors.

Q: What are the recommended use cases?

The model is ideal for vision-language tasks including image-text matching, visual question answering, and masked language modeling with visual context. It's particularly effective for applications requiring deep understanding of relationships between visual and textual content.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.