Yi-VL-6B

Maintained By
01-ai

Yi-VL-6B

PropertyValue
LicenseApache 2.0
Research PaperYi: Open Foundation Models
ArchitectureLLaVA with CLIP ViT-H/14
Resolution448x448

What is Yi-VL-6B?

Yi-VL-6B is a powerful vision-language model that combines advanced image understanding capabilities with sophisticated language processing. Built on the LLaVA architecture, it integrates a CLIP ViT-H/14 vision transformer with the Yi-6B-Chat language model to enable detailed image comprehension and natural conversation in both English and Chinese.

Implementation Details

The model employs a three-component architecture consisting of a Vision Transformer (ViT) for image encoding, a projection module for feature alignment, and a large language model for text processing. Training occurred across three stages, focusing on image-text alignment, high-resolution processing, and multimodal conversation capabilities.

  • Supports high-resolution image processing (448x448)
  • Trained on 128 NVIDIA A800 GPUs
  • Completed training in approximately 3 days
  • Utilizes comprehensive datasets including LAION-400M, CLLaVA, and various visual question-answering datasets

Core Capabilities

  • Multi-round text-image conversations with single image input
  • Bilingual support for English and Chinese
  • Advanced image comprehension and information extraction
  • Fine-grained visual detail recognition
  • Text recognition in images

Frequently Asked Questions

Q: What makes this model unique?

Yi-VL-6B stands out for its exceptional bilingual capabilities and high-resolution image understanding, making it particularly effective for detailed visual analysis and natural conversations about images in both English and Chinese.

Q: What are the recommended use cases?

The model excels in visual question answering, image content analysis, multilingual image-based conversations, and detailed visual information extraction. It's particularly suitable for applications requiring sophisticated image understanding and natural language interaction.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.