LLM2CLIP-Openai-L-14-336
Property | Value |
---|---|
Parameter Count | 579M |
License | Apache 2.0 |
Paper | arXiv:2411.04997 |
Training Data | CC3M, CC12M, YFCC15M, Recap-DataComp-1B(30M subset) |
What is LLM2CLIP-Openai-L-14-336?
LLM2CLIP-Openai-L-14-336 is an innovative vision-language model that extends CLIP's capabilities by leveraging Large Language Models. Developed by Microsoft, this model represents a significant advancement in cross-modal understanding, achieving a 16.5% improvement over previous SOTA EVA02 model in both long-text and short-text retrieval tasks.
Implementation Details
The model employs a sophisticated architecture that fine-tunes LLMs in the caption space using contrastive learning. The fine-tuned LLM serves as a teacher for CLIP's visual encoder, enabling the processing of longer and more complex captions beyond vanilla CLIP's limitations.
- F32 tensor type implementation
- Supports zero-shot classification pipeline
- Includes custom code for enhanced flexibility
Core Capabilities
- Enhanced cross-lingual performance despite English-only training
- Superior performance in multimodal tasks when integrated with models like Llava 1.5
- Improved textual discriminability through LLM integration
- Efficient processing of longer and more complex captions
Frequently Asked Questions
Q: What makes this model unique?
The model's unique approach of using LLMs as teachers for CLIP's visual encoder allows it to handle more complex and longer text descriptions while maintaining high performance in cross-modal tasks.
Q: What are the recommended use cases?
The model excels in image-text retrieval tasks, zero-shot classification, and cross-lingual applications. It's particularly useful for applications requiring sophisticated understanding of both visual and textual content.