ColPali

Property	Value
Base Model	google/paligemma-3b-mix-448
License	MIT
Paper	ColPali: Efficient Document Retrieval with Vision Language Models
Primary Language	English

What is ColPali?

ColPali represents a breakthrough in visual document retrieval, combining PaliGemma-3B's powerful language understanding capabilities with ColBERT's efficient multi-vector representation strategy. Built on SigLIP and enhanced through BiSigLIP, this model specializes in processing and indexing documents based on their visual features with remarkable efficiency.

Implementation Details

The model utilizes a sophisticated architecture that processes image patch embeddings through a language model, creating a unified latent space for both text and image inputs. Trained on 127,460 query-page pairs, including academic datasets and synthetic data, ColPali employs LoRA adapters with alpha=32 and r=32 for optimal performance.

Trained using bfloat16 format with 8-GPU parallel processing
Implements paged_adamw_8bit optimizer with 5e-5 learning rate
Utilizes linear decay with 2.5% warmup steps
Batch size of 32 for efficient training

Core Capabilities

Efficient document indexing from visual features
Multi-vector representations of text and images
Native mapping of image patches to text-compatible latent space
Zero-shot generalization potential to non-English languages

Frequently Asked Questions

Q: What makes this model unique?

ColPali's unique strength lies in its ability to create ColBERT-style multi-vector representations of both text and images, enabling more nuanced document retrieval through patch-level interactions.

Q: What are the recommended use cases?

The model excels in PDF document retrieval and processing, particularly for academic and technical documents in high-resource languages. It's especially useful for systems requiring precise document indexing and retrieval based on both visual and textual content.

colpali