chinese-clip-vit-base-patch16

chinese-clip-vit-base-patch16

OFA-Sys

Chinese CLIP model using ViT-B/16 image encoder and RoBERTa-wwm-base text encoder, trained on 200M Chinese image-text pairs for multimodal understanding and zero-shot classification.

PropertyValue
AuthorOFA-Sys
PaperarXiv:2211.01335
ArchitectureViT-B/16 + RoBERTa-wwm-base
Training Data200M Chinese image-text pairs

What is chinese-clip-vit-base-patch16?

Chinese CLIP is a powerful multimodal model that bridges Chinese text and visual content using contrastive learning. It employs a ViT-B/16 architecture for image encoding and RoBERTa-wwm-base for text processing, trained on an extensive dataset of 200 million Chinese image-text pairs.

Implementation Details

The model implements a dual-encoder architecture that processes images and text separately before computing similarity scores. It achieves state-of-the-art performance in various Chinese vision-language tasks, including zero-shot classification and cross-modal retrieval.

  • Utilizes Vision Transformer (ViT) architecture with 16x16 patch size
  • Implements contrastive learning approach similar to OpenAI's CLIP
  • Supports zero-shot classification capabilities
  • Provides both image and text feature extraction

Core Capabilities

  • Text-to-Image Retrieval (71.2% R@1 on Flickr30K-CN zero-shot)
  • Image-to-Text Retrieval (81.6% R@1 on Flickr30K-CN zero-shot)
  • Zero-shot Image Classification (96.0% on CIFAR10)
  • Cross-modal Similarity Computation

Frequently Asked Questions

Q: What makes this model unique?

This model is specifically designed for Chinese language understanding and vision tasks, offering superior performance compared to previous Chinese multimodal models. It achieves significant improvements in zero-shot capabilities and cross-modal retrieval tasks.

Q: What are the recommended use cases?

The model excels in image-text matching, zero-shot image classification, and cross-modal retrieval tasks in Chinese. It's particularly suitable for applications like content recommendation, visual search, and automatic image captioning in Chinese language contexts.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026