ColSmolVLM-alpha
Property | Value |
---|---|
Base Model | vidore/ColSmolVLM-base |
License | Apache 2.0 (backbone) / MIT (adapters) |
Paper | ColPali: Efficient Document Retrieval with Vision Language Models |
Training Data | 127,460 query-page pairs |
What is colsmolvlm-alpha?
ColSmolVLM-alpha is an innovative visual language model designed for efficient document retrieval. It extends SmolVLM by incorporating ColBERT-style multi-vector representations for both text and images. This version was trained with a batch size of 128 for 3 epochs, utilizing the PEFT (Parameter Efficient Fine-Tuning) approach with LoRA adapters.
Implementation Details
The model employs bfloat16 format and uses LoRA with alpha=32 and r=32 on transformer layers. It's trained using a paged_adamw_8bit optimizer on a 4 GPU setup with data parallelism, featuring a learning rate of 5e-4 with linear decay and 2.5% warmup steps.
- Trained on 127,460 query-page pairs (63% academic datasets, 37% synthetic data)
- Uses flash attention 2 for efficient processing
- Implements ColBERT late interaction mechanism
Core Capabilities
- Multi-vector representation generation for both text and images
- Efficient document indexing from visual features
- Zero-shot generalization to non-English languages
- PDF document processing and retrieval
Frequently Asked Questions
Q: What makes this model unique?
The model combines the efficiency of SmolVLM with ColBERT's multi-vector representation strategy, enabling more nuanced document retrieval capabilities while maintaining computational efficiency.
Q: What are the recommended use cases?
The model is particularly suited for PDF document retrieval tasks, especially in academic and professional contexts where precise document matching is crucial. It's designed to handle both text and visual elements effectively.