ColSmolVLM-v0.1
Property | Value |
---|---|
License | Apache 2.0 (backbone) / MIT (adapters) |
Paper | ColPali: Efficient Document Retrieval with Vision Language Models |
Training Data | 127,460 query-page pairs |
Architecture | SmolVLM with ColBERT strategy |
What is colsmolvlm-v0.1?
ColSmolVLM is an innovative vision language model specifically designed for efficient document retrieval. It combines the capabilities of SmolVLM with ColBERT-style multi-vector representations, enabling sophisticated indexing and retrieval of documents based on their visual features. This version has been trained with a batch size of 128 for 3 epochs, utilizing the colpali-engine v0.3.5.
Implementation Details
The model employs advanced training techniques including bfloat16 format, low-rank adapters (LoRA) with alpha=32 and r=32, and a paged_adamw_8bit optimizer. Training was conducted on a 4 GPU setup with data parallelism, using a learning rate of 5e-4 with linear decay and 2.5% warmup steps.
- Multi-vector representation capability
- Flash Attention 2 support
- Efficient document indexing and retrieval
- Zero-shot generalization to non-English languages
Core Capabilities
- PDF document analysis and retrieval
- Visual feature extraction and indexing
- Cross-modal understanding between text and images
- Efficient batch processing of queries and images
Frequently Asked Questions
Q: What makes this model unique?
The model uniquely combines SmolVLM's vision capabilities with ColBERT's efficient retrieval strategy, creating a powerful system for document retrieval based on visual features. Its multi-vector representation approach allows for more nuanced document understanding and retrieval.
Q: What are the recommended use cases?
The model is particularly well-suited for PDF document retrieval, academic research, and document analysis tasks. It excels in scenarios requiring efficient indexing and retrieval of documents based on both visual and textual content, especially in high-resource language environments.