Yi-VL-34B

Yi-VL-34B

01-ai

Yi-VL-34B is a state-of-the-art bilingual vision-language model with 34B parameters, supporting multi-round image-text conversations and achieving top performance in MMMU benchmarks.

PropertyValue
LicenseApache 2.0
ArchitectureLLaVA-based with CLIP ViT-H/14
Research PaperYi: Open Foundation Models
Base LLMYi-34B-Chat

What is Yi-VL-34B?

Yi-VL-34B is the world's first open-source 34B vision language model, designed for advanced image understanding and bilingual conversation. Built by 01-ai, it combines a powerful Vision Transformer with the Yi-34B language model to enable sophisticated image-text interactions.

Implementation Details

The model leverages a three-component architecture: a CLIP ViT-H/14 for image encoding, a projection module for feature alignment, and the Yi-34B-Chat LLM. It supports high-resolution image processing (448×448) and underwent a comprehensive three-stage training process using over 100 million image-text pairs.

  • Multi-stage training on diverse datasets including LAION-400M, CLLaVA, and specialized visual datasets
  • Trained using 128 NVIDIA A800 GPUs over approximately 10 days
  • Implements advanced bilingual capabilities for both English and Chinese

Core Capabilities

  • Multi-round text-image conversations with single image input
  • High-resolution image understanding at 448×448
  • Top performance in MMMU and CMMMU benchmarks
  • Strong bilingual support for English and Chinese
  • Advanced visual information extraction and summarization

Frequently Asked Questions

Q: What makes this model unique?

Yi-VL-34B is the first open-source 34B vision language model worldwide, offering superior bilingual capabilities and achieving top performance in major benchmarks. Its high-resolution processing and comprehensive training make it particularly effective for detailed image analysis.

Q: What are the recommended use cases?

The model excels in visual question answering, image content analysis, bilingual image-based conversations, and detailed visual information extraction. It's particularly suitable for applications requiring sophisticated image understanding in both English and Chinese contexts.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026