Kimi-VL-A3B-Instruct

Maintained By
moonshotai

Kimi-VL-A3B-Instruct

PropertyValue
Total Parameters16B
Activated Parameters2.8B
Context Length128K tokens
Model TypeVision-Language Model (MoE)
PaperarXiv:2504.07491
Model HubHugging Face

What is Kimi-VL-A3B-Instruct?

Kimi-VL-A3B-Instruct is an efficient Mixture-of-Experts (MoE) vision-language model that revolutionizes multimodal AI by offering advanced capabilities while activating only 2.8B parameters. Built with a native-resolution visual encoder (MoonViT) and sophisticated MLP projector, it achieves state-of-the-art performance across various challenging tasks while maintaining computational efficiency.

Implementation Details

The model architecture combines an MoE language model with MoonViT visual encoder, enabling efficient processing of high-resolution images and extended context lengths. It operates with a total of 16B parameters but actively uses only 2.8B during inference, making it highly efficient for deployment.

  • Extended 128K context window for processing long inputs
  • Native-resolution vision processing capability
  • Recommended temperature setting of 0.2 for optimal performance
  • Supports multiple input types including images, videos, and documents

Core Capabilities

  • Strong performance in college-level comprehension tasks (57.0 on MMMU-Val)
  • Superior OCR capabilities (83.2 on InfoVQA, 867 on OCRBench)
  • Excellent agent interaction performance (92.8 on ScreenSpot-V2)
  • Advanced video understanding (64.5 on LongVideoBench)
  • Robust multi-image processing and mathematical reasoning

Frequently Asked Questions

Q: What makes this model unique?

Its MoE architecture allows for impressive performance while activating only 2.8B parameters, making it highly efficient compared to dense models. The native-resolution vision encoder and extended context window set it apart in handling complex visual and textual inputs.

Q: What are the recommended use cases?

The model excels in general multimodal perception, OCR, long video and document processing, and agent-based tasks. It's particularly well-suited for applications requiring efficient processing of high-resolution visual inputs and extended context understanding.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.