colpali

Maintained By
vidore

ColPali

PropertyValue
Base Modelgoogle/paligemma-3b-mix-448
LicenseMIT
PaperColPali: Efficient Document Retrieval with Vision Language Models
Primary LanguageEnglish

What is ColPali?

ColPali represents a breakthrough in visual document retrieval, combining PaliGemma-3B's powerful language understanding capabilities with ColBERT's efficient multi-vector representation strategy. Built on SigLIP and enhanced through BiSigLIP, this model specializes in processing and indexing documents based on their visual features with remarkable efficiency.

Implementation Details

The model utilizes a sophisticated architecture that processes image patch embeddings through a language model, creating a unified latent space for both text and image inputs. Trained on 127,460 query-page pairs, including academic datasets and synthetic data, ColPali employs LoRA adapters with alpha=32 and r=32 for optimal performance.

  • Trained using bfloat16 format with 8-GPU parallel processing
  • Implements paged_adamw_8bit optimizer with 5e-5 learning rate
  • Utilizes linear decay with 2.5% warmup steps
  • Batch size of 32 for efficient training

Core Capabilities

  • Efficient document indexing from visual features
  • Multi-vector representations of text and images
  • Native mapping of image patches to text-compatible latent space
  • Zero-shot generalization potential to non-English languages

Frequently Asked Questions

Q: What makes this model unique?

ColPali's unique strength lies in its ability to create ColBERT-style multi-vector representations of both text and images, enabling more nuanced document retrieval through patch-level interactions.

Q: What are the recommended use cases?

The model excels in PDF document retrieval and processing, particularly for academic and technical documents in high-resource languages. It's especially useful for systems requiring precise document indexing and retrieval based on both visual and textual content.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.