ColSmolVLM-v0.1

Property	Value
License	Apache 2.0 (backbone) / MIT (adapters)
Paper	ColPali: Efficient Document Retrieval with Vision Language Models
Training Data	127,460 query-page pairs
Architecture	SmolVLM with ColBERT strategy

What is colsmolvlm-v0.1?

ColSmolVLM is an innovative vision language model specifically designed for efficient document retrieval. It combines the capabilities of SmolVLM with ColBERT-style multi-vector representations, enabling sophisticated indexing and retrieval of documents based on their visual features. This version has been trained with a batch size of 128 for 3 epochs, utilizing the colpali-engine v0.3.5.

Implementation Details

The model employs advanced training techniques including bfloat16 format, low-rank adapters (LoRA) with alpha=32 and r=32, and a paged_adamw_8bit optimizer. Training was conducted on a 4 GPU setup with data parallelism, using a learning rate of 5e-4 with linear decay and 2.5% warmup steps.

Multi-vector representation capability
Flash Attention 2 support
Efficient document indexing and retrieval
Zero-shot generalization to non-English languages

Core Capabilities

PDF document analysis and retrieval
Visual feature extraction and indexing
Cross-modal understanding between text and images
Efficient batch processing of queries and images

Frequently Asked Questions

Q: What makes this model unique?

The model uniquely combines SmolVLM's vision capabilities with ColBERT's efficient retrieval strategy, creating a powerful system for document retrieval based on visual features. Its multi-vector representation approach allows for more nuanced document understanding and retrieval.

Q: What are the recommended use cases?

The model is particularly well-suited for PDF document retrieval, academic research, and document analysis tasks. It excels in scenarios requiring efficient indexing and retrieval of documents based on both visual and textual content, especially in high-resource language environments.

colsmolvlm-v0.1