Kimi-VL-A3B-Instruct
Property | Value |
---|---|
Total Parameters | 16B |
Activated Parameters | 2.8B |
Context Length | 128K tokens |
Model Type | Vision-Language Model (MoE) |
Paper | arXiv:2504.07491 |
Model Hub | Hugging Face |
What is Kimi-VL-A3B-Instruct?
Kimi-VL-A3B-Instruct is an efficient Mixture-of-Experts (MoE) vision-language model that revolutionizes multimodal AI by offering advanced capabilities while activating only 2.8B parameters. Built with a native-resolution visual encoder (MoonViT) and sophisticated MLP projector, it achieves state-of-the-art performance across various challenging tasks while maintaining computational efficiency.
Implementation Details
The model architecture combines an MoE language model with MoonViT visual encoder, enabling efficient processing of high-resolution images and extended context lengths. It operates with a total of 16B parameters but actively uses only 2.8B during inference, making it highly efficient for deployment.
- Extended 128K context window for processing long inputs
- Native-resolution vision processing capability
- Recommended temperature setting of 0.2 for optimal performance
- Supports multiple input types including images, videos, and documents
Core Capabilities
- Strong performance in college-level comprehension tasks (57.0 on MMMU-Val)
- Superior OCR capabilities (83.2 on InfoVQA, 867 on OCRBench)
- Excellent agent interaction performance (92.8 on ScreenSpot-V2)
- Advanced video understanding (64.5 on LongVideoBench)
- Robust multi-image processing and mathematical reasoning
Frequently Asked Questions
Q: What makes this model unique?
Its MoE architecture allows for impressive performance while activating only 2.8B parameters, making it highly efficient compared to dense models. The native-resolution vision encoder and extended context window set it apart in handling complex visual and textual inputs.
Q: What are the recommended use cases?
The model excels in general multimodal perception, OCR, long video and document processing, and agent-based tasks. It's particularly well-suited for applications requiring efficient processing of high-resolution visual inputs and extended context understanding.