ColSmolVLM-alpha

Property	Value
Base Model	vidore/ColSmolVLM-base
License	Apache 2.0 (backbone) / MIT (adapters)
Paper	ColPali: Efficient Document Retrieval with Vision Language Models
Training Data	127,460 query-page pairs

What is colsmolvlm-alpha?

ColSmolVLM-alpha is an innovative visual language model designed for efficient document retrieval. It extends SmolVLM by incorporating ColBERT-style multi-vector representations for both text and images. This version was trained with a batch size of 128 for 3 epochs, utilizing the PEFT (Parameter Efficient Fine-Tuning) approach with LoRA adapters.

Implementation Details

The model employs bfloat16 format and uses LoRA with alpha=32 and r=32 on transformer layers. It's trained using a paged_adamw_8bit optimizer on a 4 GPU setup with data parallelism, featuring a learning rate of 5e-4 with linear decay and 2.5% warmup steps.

Trained on 127,460 query-page pairs (63% academic datasets, 37% synthetic data)
Uses flash attention 2 for efficient processing
Implements ColBERT late interaction mechanism

Core Capabilities

Multi-vector representation generation for both text and images
Efficient document indexing from visual features
Zero-shot generalization to non-English languages
PDF document processing and retrieval

Frequently Asked Questions

Q: What makes this model unique?

The model combines the efficiency of SmolVLM with ColBERT's multi-vector representation strategy, enabling more nuanced document retrieval capabilities while maintaining computational efficiency.

Q: What are the recommended use cases?

The model is particularly suited for PDF document retrieval tasks, especially in academic and professional contexts where precise document matching is crucial. It's designed to handle both text and visual elements effectively.