bridgetower-large-itm-mlm-itc

bridgetower-large-itm-mlm-itc

BridgeTower

BridgeTower vision-language model with state-of-the-art performance on VQAv2. Features innovative bridge layers for cross-modal alignment. MIT licensed.

PropertyValue
LicenseMIT
PaperView Paper
Training Datasets5 (CC3M, CC12M, SBU, MSCOCO, Visual Genome)
FrameworkPyTorch

What is bridgetower-large-itm-mlm-itc?

BridgeTower is a groundbreaking vision-language model that introduces an innovative architecture featuring multiple bridge layers connecting uni-modal encoders with cross-modal encoders. The model achieves state-of-the-art performance on various vision-language tasks, notably reaching 78.73% accuracy on VQAv2 test-std set with only 4M images of pre-training data.

Implementation Details

The model is implemented using PyTorch and supports three main functionalities: contrastive learning between image-text pairs, image-text matching, and masked language modeling. It was pre-trained on a massive scale using 512 Gaudis and 128 Xeons with a 2048 batch size for 10 epochs.

  • Utilizes AdamW optimizer with 1e-7 learning rate
  • Image resolution: 294x294 pixels
  • Pre-trained on 14M unique images across 5 datasets
  • Implements bridge layers for effective bottom-up cross-modal alignment

Core Capabilities

  • Contrastive Learning between image and text pairs
  • Image and Text Matching
  • Masked Language Modeling
  • Cross-modal representation learning
  • Visual Question Answering

Frequently Asked Questions

Q: What makes this model unique?

BridgeTower's uniqueness lies in its bridge layers architecture that enables effective bottom-up cross-modal alignment between visual and textual representations at different semantic levels, achieving superior performance with significantly less pre-training data than competitors.

Q: What are the recommended use cases?

The model is ideal for vision-language tasks including image-text matching, visual question answering, and masked language modeling with visual context. It's particularly effective for applications requiring deep understanding of relationships between visual and textual content.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026