Kimi-VL-A3B-Instruct

Kimi-VL-A3B-Instruct

moonshotai

Efficient 16B parameter MoE vision-language model with only 2.8B active parameters, featuring 128K context window and strong performance in multimodal tasks, OCR, and agent capabilities.

PropertyValue
Total Parameters16B
Activated Parameters2.8B
Context Length128K tokens
Model TypeVision-Language Model (MoE)
PaperarXiv:2504.07491
Model HubHugging Face

What is Kimi-VL-A3B-Instruct?

Kimi-VL-A3B-Instruct is an efficient Mixture-of-Experts (MoE) vision-language model that revolutionizes multimodal AI by offering advanced capabilities while activating only 2.8B parameters. Built with a native-resolution visual encoder (MoonViT) and sophisticated MLP projector, it achieves state-of-the-art performance across various challenging tasks while maintaining computational efficiency.

Implementation Details

The model architecture combines an MoE language model with MoonViT visual encoder, enabling efficient processing of high-resolution images and extended context lengths. It operates with a total of 16B parameters but actively uses only 2.8B during inference, making it highly efficient for deployment.

  • Extended 128K context window for processing long inputs
  • Native-resolution vision processing capability
  • Recommended temperature setting of 0.2 for optimal performance
  • Supports multiple input types including images, videos, and documents

Core Capabilities

  • Strong performance in college-level comprehension tasks (57.0 on MMMU-Val)
  • Superior OCR capabilities (83.2 on InfoVQA, 867 on OCRBench)
  • Excellent agent interaction performance (92.8 on ScreenSpot-V2)
  • Advanced video understanding (64.5 on LongVideoBench)
  • Robust multi-image processing and mathematical reasoning

Frequently Asked Questions

Q: What makes this model unique?

Its MoE architecture allows for impressive performance while activating only 2.8B parameters, making it highly efficient compared to dense models. The native-resolution vision encoder and extended context window set it apart in handling complex visual and textual inputs.

Q: What are the recommended use cases?

The model excels in general multimodal perception, OCR, long video and document processing, and agent-based tasks. It's particularly well-suited for applications requiring efficient processing of high-resolution visual inputs and extended context understanding.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026