colpali-v1.2-hf

colpali-v1.2-hf

vidore

ColPali is a PaliGemma-3B based visual retrieval model that uses ColBERT strategy for efficient document indexing, combining SigLIP and language model capabilities.

PropertyValue
LicenseGemma license (backbone) + MIT (adapters)
PaperarXiv:2407.01449
Training Data127,460 query-page pairs
ArchitecturePaliGemma-3B with ColBERT strategy

What is colpali-v1.2-hf?

ColPali is an innovative Vision Language Model (VLM) designed specifically for efficient document retrieval. Built on PaliGemma-3B, it generates ColBERT-style multi-vector representations of both text and images, offering a novel approach to document indexing and retrieval. The model combines SigLIP's visual capabilities with advanced language modeling to create a powerful retrieval system.

Implementation Details

The model is implemented using a sophisticated architecture that processes image patch embeddings through a language model, creating a unified latent space for both textual and visual content. It uses LoRA adapters with alpha=32 and r=32 on transformer layers, trained with bfloat16 format and a paged_adamw_8bit optimizer.

  • Training conducted on 8 GPUs with data parallelism
  • Learning rate of 5e-5 with linear decay
  • 2.5% warmup steps and batch size of 32
  • Trained on English dataset with zero-shot multilingual capabilities

Core Capabilities

  • Efficient document indexing from visual features
  • Multi-vector representation generation
  • Cross-modal retrieval between text and images
  • Zero-shot generalization to non-English languages
  • PDF document processing and analysis

Frequently Asked Questions

Q: What makes this model unique?

ColPali's unique strength lies in its ability to map image patch embeddings to a latent space similar to textual input, enabling efficient ColBERT-style interactions between text tokens and image patches. This approach significantly improves retrieval performance compared to traditional methods.

Q: What are the recommended use cases?

The model is particularly well-suited for document retrieval tasks, especially those involving PDF documents. It excels in scenarios requiring cross-modal understanding between text queries and visual document content, making it valuable for digital library systems, document management, and academic research.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026