colpali-v1.2-hf

vidore

ColPali is a PaliGemma-3B based visual retrieval model that uses ColBERT strategy for efficient document indexing, combining SigLIP and language model capabilities.

Property	Value
License	Gemma license (backbone) + MIT (adapters)
Paper	arXiv:2407.01449
Training Data	127,460 query-page pairs
Architecture	PaliGemma-3B with ColBERT strategy

What is colpali-v1.2-hf?

ColPali is an innovative Vision Language Model (VLM) designed specifically for efficient document retrieval. Built on PaliGemma-3B, it generates ColBERT-style multi-vector representations of both text and images, offering a novel approach to document indexing and retrieval. The model combines SigLIP's visual capabilities with advanced language modeling to create a powerful retrieval system.

Implementation Details

The model is implemented using a sophisticated architecture that processes image patch embeddings through a language model, creating a unified latent space for both textual and visual content. It uses LoRA adapters with alpha=32 and r=32 on transformer layers, trained with bfloat16 format and a paged_adamw_8bit optimizer.

Training conducted on 8 GPUs with data parallelism
Learning rate of 5e-5 with linear decay
2.5% warmup steps and batch size of 32
Trained on English dataset with zero-shot multilingual capabilities

Core Capabilities

Efficient document indexing from visual features
Multi-vector representation generation
Cross-modal retrieval between text and images
Zero-shot generalization to non-English languages
PDF document processing and analysis

Frequently Asked Questions

Q: What makes this model unique?

ColPali's unique strength lies in its ability to map image patch embeddings to a latent space similar to textual input, enabling efficient ColBERT-style interactions between text tokens and image patches. This approach significantly improves retrieval performance compared to traditional methods.

Q: What are the recommended use cases?

The model is particularly well-suited for document retrieval tasks, especially those involving PDF documents. It excels in scenarios requiring cross-modal understanding between text queries and visual document content, making it valuable for digital library systems, document management, and academic research.