colqwen2-v1.0

Maintained By
vidore

ColQwen2 v1.0

PropertyValue
LicenseMIT
Base ModelQwen2-VL-2B-Instruct
PaperColPali: Efficient Document Retrieval with Vision Language Models
LanguageEnglish

What is colqwen2-v1.0?

ColQwen2 v1.0 is an innovative visual retrieval model that combines the power of Qwen2-VL-2B-Instruct with ColBERT strategy to efficiently index and retrieve documents based on their visual features. This version represents a significant improvement with its larger batch size training (256 instead of 32) and updated pad token implementation.

Implementation Details

The model utilizes a dynamic image resolution approach, allowing it to process images without forced resizing or aspect ratio changes. It's limited to creating a maximum of 768 image patches, striking a balance between performance and memory efficiency. Training was conducted using bfloat16 format with LoRA adapters (alpha=32, r=32) and a paged_adamw_8bit optimizer.

  • Trained on 127,460 query-page pairs
  • Uses low-rank adapters for transformer layers
  • Implements ColBERT-style multi-vector representations
  • Supports dynamic image resolutions

Core Capabilities

  • Efficient document indexing from visual features
  • Multi-vector representation generation
  • Dynamic image resolution processing
  • Zero-shot generalization potential to non-English languages

Frequently Asked Questions

Q: What makes this model unique?

The model's unique combination of Qwen2-VL architecture with ColBERT strategy, along with its ability to handle dynamic image resolutions and generate multi-vector representations, sets it apart in the document retrieval space.

Q: What are the recommended use cases?

The model is particularly well-suited for PDF-type document retrieval tasks, academic research, and applications requiring efficient visual-textual document indexing and retrieval.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.