colsmolvlm-alpha

Maintained By
vidore

ColSmolVLM-alpha

PropertyValue
Base Modelvidore/ColSmolVLM-base
LicenseApache 2.0 (backbone) / MIT (adapters)
PaperColPali: Efficient Document Retrieval with Vision Language Models
Training Data127,460 query-page pairs

What is colsmolvlm-alpha?

ColSmolVLM-alpha is an innovative visual language model designed for efficient document retrieval. It extends SmolVLM by incorporating ColBERT-style multi-vector representations for both text and images. This version was trained with a batch size of 128 for 3 epochs, utilizing the PEFT (Parameter Efficient Fine-Tuning) approach with LoRA adapters.

Implementation Details

The model employs bfloat16 format and uses LoRA with alpha=32 and r=32 on transformer layers. It's trained using a paged_adamw_8bit optimizer on a 4 GPU setup with data parallelism, featuring a learning rate of 5e-4 with linear decay and 2.5% warmup steps.

  • Trained on 127,460 query-page pairs (63% academic datasets, 37% synthetic data)
  • Uses flash attention 2 for efficient processing
  • Implements ColBERT late interaction mechanism

Core Capabilities

  • Multi-vector representation generation for both text and images
  • Efficient document indexing from visual features
  • Zero-shot generalization to non-English languages
  • PDF document processing and retrieval

Frequently Asked Questions

Q: What makes this model unique?

The model combines the efficiency of SmolVLM with ColBERT's multi-vector representation strategy, enabling more nuanced document retrieval capabilities while maintaining computational efficiency.

Q: What are the recommended use cases?

The model is particularly suited for PDF document retrieval tasks, especially in academic and professional contexts where precise document matching is crucial. It's designed to handle both text and visual elements effectively.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.