InternVL2_5-4B-AWQ

InternVL2_5-4B-AWQ

rootonchair

AWQ-quantized version of InternVL2_5-4B, optimized for vision-language tasks with minimal performance loss (82.3% on MMBench). Supports multi-modal chat and video analysis.

PropertyValue
Model Size4B parameters (quantized)
Model TypeMulti-modal Vision-Language Model
QuantizationAWQ (Activation-aware Weight Quantization)
Hugging Facerootonchair/InternVL2_5-4B-AWQ

What is InternVL2_5-4B-AWQ?

InternVL2_5-4B-AWQ is a quantized version of the original InternVL2_5-4B model, optimized using AWQ (Activation-aware Weight Quantization) technology. This model maintains impressive performance metrics, achieving 82.3% on MMBench_DEV_EN and 80.5% on OCRBench, demonstrating minimal degradation compared to the original model's performance.

Implementation Details

The model leverages advanced quantization techniques while maintaining compatibility with the Transformers library (requires version ≥4.37.2). It supports various deployment configurations including 16-bit precision, 8-bit quantization, and multi-GPU inference, making it highly versatile for different computational requirements.

  • Supports dynamic image preprocessing with adaptive tiling
  • Implements efficient multi-GPU distribution for large-scale deployment
  • Features Flash Attention optimization for improved performance
  • Enables both single and multi-image processing capabilities

Core Capabilities

  • Pure text conversation with context awareness
  • Single-image and multi-image analysis
  • Video frame analysis and interpretation
  • Multi-round conversations with visual context
  • Batch inference processing for improved throughput

Frequently Asked Questions

Q: What makes this model unique?

The model stands out for its efficient quantization that maintains high performance while reducing computational requirements. It achieves this through AWQ technology, making it more accessible for deployment while preserving the core capabilities of the original model.

Q: What are the recommended use cases?

The model excels in various scenarios including image description, visual question answering, multi-image comparison, and video analysis. It's particularly suitable for applications requiring efficient deployment while maintaining high-quality vision-language capabilities.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026