BridgeTower Large ITM-MLM-ITC Model

Property	Value
License	MIT
Paper	View Paper
Training Datasets	5 (CC3M, CC12M, SBU, MSCOCO, Visual Genome)
Framework	PyTorch

What is bridgetower-large-itm-mlm-itc?

BridgeTower is a groundbreaking vision-language model that introduces an innovative architecture featuring multiple bridge layers connecting uni-modal encoders with cross-modal encoders. The model achieves state-of-the-art performance on various vision-language tasks, notably reaching 78.73% accuracy on VQAv2 test-std set with only 4M images of pre-training data.

Implementation Details

The model is implemented using PyTorch and supports three main functionalities: contrastive learning between image-text pairs, image-text matching, and masked language modeling. It was pre-trained on a massive scale using 512 Gaudis and 128 Xeons with a 2048 batch size for 10 epochs.

Utilizes AdamW optimizer with 1e-7 learning rate
Image resolution: 294x294 pixels
Pre-trained on 14M unique images across 5 datasets
Implements bridge layers for effective bottom-up cross-modal alignment

Core Capabilities

Contrastive Learning between image and text pairs
Image and Text Matching
Masked Language Modeling
Cross-modal representation learning
Visual Question Answering

Frequently Asked Questions

Q: What makes this model unique?

BridgeTower's uniqueness lies in its bridge layers architecture that enables effective bottom-up cross-modal alignment between visual and textual representations at different semantic levels, achieving superior performance with significantly less pre-training data than competitors.

Q: What are the recommended use cases?

The model is ideal for vision-language tasks including image-text matching, visual question answering, and masked language modeling with visual context. It's particularly effective for applications requiring deep understanding of relationships between visual and textual content.