Yi-VL-6B

Yi-VL-6B

01-ai

Yi-VL-6B is a bilingual vision-language model offering high-resolution image understanding and multi-round conversations, built on LLaVA architecture with CLIP ViT-H/14 and Yi-6B-Chat foundation.

PropertyValue
LicenseApache 2.0
Research PaperYi: Open Foundation Models
ArchitectureLLaVA with CLIP ViT-H/14
Resolution448x448

What is Yi-VL-6B?

Yi-VL-6B is a powerful vision-language model that combines advanced image understanding capabilities with sophisticated language processing. Built on the LLaVA architecture, it integrates a CLIP ViT-H/14 vision transformer with the Yi-6B-Chat language model to enable detailed image comprehension and natural conversation in both English and Chinese.

Implementation Details

The model employs a three-component architecture consisting of a Vision Transformer (ViT) for image encoding, a projection module for feature alignment, and a large language model for text processing. Training occurred across three stages, focusing on image-text alignment, high-resolution processing, and multimodal conversation capabilities.

  • Supports high-resolution image processing (448x448)
  • Trained on 128 NVIDIA A800 GPUs
  • Completed training in approximately 3 days
  • Utilizes comprehensive datasets including LAION-400M, CLLaVA, and various visual question-answering datasets

Core Capabilities

  • Multi-round text-image conversations with single image input
  • Bilingual support for English and Chinese
  • Advanced image comprehension and information extraction
  • Fine-grained visual detail recognition
  • Text recognition in images

Frequently Asked Questions

Q: What makes this model unique?

Yi-VL-6B stands out for its exceptional bilingual capabilities and high-resolution image understanding, making it particularly effective for detailed visual analysis and natural conversations about images in both English and Chinese.

Q: What are the recommended use cases?

The model excels in visual question answering, image content analysis, multilingual image-based conversations, and detailed visual information extraction. It's particularly suitable for applications requiring sophisticated image understanding and natural language interaction.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026