Yi-VL-34B
Property | Value |
---|---|
License | Apache 2.0 |
Architecture | LLaVA-based with CLIP ViT-H/14 |
Research Paper | Yi: Open Foundation Models |
Base LLM | Yi-34B-Chat |
What is Yi-VL-34B?
Yi-VL-34B is the world's first open-source 34B vision language model, designed for advanced image understanding and bilingual conversation. Built by 01-ai, it combines a powerful Vision Transformer with the Yi-34B language model to enable sophisticated image-text interactions.
Implementation Details
The model leverages a three-component architecture: a CLIP ViT-H/14 for image encoding, a projection module for feature alignment, and the Yi-34B-Chat LLM. It supports high-resolution image processing (448×448) and underwent a comprehensive three-stage training process using over 100 million image-text pairs.
- Multi-stage training on diverse datasets including LAION-400M, CLLaVA, and specialized visual datasets
- Trained using 128 NVIDIA A800 GPUs over approximately 10 days
- Implements advanced bilingual capabilities for both English and Chinese
Core Capabilities
- Multi-round text-image conversations with single image input
- High-resolution image understanding at 448×448
- Top performance in MMMU and CMMMU benchmarks
- Strong bilingual support for English and Chinese
- Advanced visual information extraction and summarization
Frequently Asked Questions
Q: What makes this model unique?
Yi-VL-34B is the first open-source 34B vision language model worldwide, offering superior bilingual capabilities and achieving top performance in major benchmarks. Its high-resolution processing and comprehensive training make it particularly effective for detailed image analysis.
Q: What are the recommended use cases?
The model excels in visual question answering, image content analysis, bilingual image-based conversations, and detailed visual information extraction. It's particularly suitable for applications requiring sophisticated image understanding in both English and Chinese contexts.