BridgeTower Large ITM-MLM-ITC Model
Property | Value |
---|---|
License | MIT |
Paper | View Paper |
Training Datasets | 5 (CC3M, CC12M, SBU, MSCOCO, Visual Genome) |
Framework | PyTorch |
What is bridgetower-large-itm-mlm-itc?
BridgeTower is a groundbreaking vision-language model that introduces an innovative architecture featuring multiple bridge layers connecting uni-modal encoders with cross-modal encoders. The model achieves state-of-the-art performance on various vision-language tasks, notably reaching 78.73% accuracy on VQAv2 test-std set with only 4M images of pre-training data.
Implementation Details
The model is implemented using PyTorch and supports three main functionalities: contrastive learning between image-text pairs, image-text matching, and masked language modeling. It was pre-trained on a massive scale using 512 Gaudis and 128 Xeons with a 2048 batch size for 10 epochs.
- Utilizes AdamW optimizer with 1e-7 learning rate
- Image resolution: 294x294 pixels
- Pre-trained on 14M unique images across 5 datasets
- Implements bridge layers for effective bottom-up cross-modal alignment
Core Capabilities
- Contrastive Learning between image and text pairs
- Image and Text Matching
- Masked Language Modeling
- Cross-modal representation learning
- Visual Question Answering
Frequently Asked Questions
Q: What makes this model unique?
BridgeTower's uniqueness lies in its bridge layers architecture that enables effective bottom-up cross-modal alignment between visual and textual representations at different semantic levels, achieving superior performance with significantly less pre-training data than competitors.
Q: What are the recommended use cases?
The model is ideal for vision-language tasks including image-text matching, visual question answering, and masked language modeling with visual context. It's particularly effective for applications requiring deep understanding of relationships between visual and textual content.