Yi-VL-34B

Property	Value
License	Apache 2.0
Architecture	LLaVA-based with CLIP ViT-H/14
Research Paper	Yi: Open Foundation Models
Base LLM	Yi-34B-Chat

What is Yi-VL-34B?

Yi-VL-34B is the world's first open-source 34B vision language model, designed for advanced image understanding and bilingual conversation. Built by 01-ai, it combines a powerful Vision Transformer with the Yi-34B language model to enable sophisticated image-text interactions.

Implementation Details

The model leverages a three-component architecture: a CLIP ViT-H/14 for image encoding, a projection module for feature alignment, and the Yi-34B-Chat LLM. It supports high-resolution image processing (448×448) and underwent a comprehensive three-stage training process using over 100 million image-text pairs.

Multi-stage training on diverse datasets including LAION-400M, CLLaVA, and specialized visual datasets
Trained using 128 NVIDIA A800 GPUs over approximately 10 days
Implements advanced bilingual capabilities for both English and Chinese

Core Capabilities

Multi-round text-image conversations with single image input
High-resolution image understanding at 448×448
Top performance in MMMU and CMMMU benchmarks
Strong bilingual support for English and Chinese
Advanced visual information extraction and summarization

Frequently Asked Questions

Q: What makes this model unique?

Yi-VL-34B is the first open-source 34B vision language model worldwide, offering superior bilingual capabilities and achieving top performance in major benchmarks. Its high-resolution processing and comprehensive training make it particularly effective for detailed image analysis.

Q: What are the recommended use cases?

The model excels in visual question answering, image content analysis, bilingual image-based conversations, and detailed visual information extraction. It's particularly suitable for applications requiring sophisticated image understanding in both English and Chinese contexts.

Yi-VL-34B

Yi-VL-34B

What is Yi-VL-34B?

Implementation Details

Core Capabilities

Frequently Asked Questions

Q: What makes this model unique?

Q: What are the recommended use cases?

Related Models