chinese-clip-vit-base-patch16

Maintained By
OFA-Sys

Chinese CLIP ViT-Base-Patch16

PropertyValue
AuthorOFA-Sys
PaperarXiv:2211.01335
ArchitectureViT-B/16 + RoBERTa-wwm-base
Training Data200M Chinese image-text pairs

What is chinese-clip-vit-base-patch16?

Chinese CLIP is a powerful multimodal model that bridges Chinese text and visual content using contrastive learning. It employs a ViT-B/16 architecture for image encoding and RoBERTa-wwm-base for text processing, trained on an extensive dataset of 200 million Chinese image-text pairs.

Implementation Details

The model implements a dual-encoder architecture that processes images and text separately before computing similarity scores. It achieves state-of-the-art performance in various Chinese vision-language tasks, including zero-shot classification and cross-modal retrieval.

  • Utilizes Vision Transformer (ViT) architecture with 16x16 patch size
  • Implements contrastive learning approach similar to OpenAI's CLIP
  • Supports zero-shot classification capabilities
  • Provides both image and text feature extraction

Core Capabilities

  • Text-to-Image Retrieval (71.2% R@1 on Flickr30K-CN zero-shot)
  • Image-to-Text Retrieval (81.6% R@1 on Flickr30K-CN zero-shot)
  • Zero-shot Image Classification (96.0% on CIFAR10)
  • Cross-modal Similarity Computation

Frequently Asked Questions

Q: What makes this model unique?

This model is specifically designed for Chinese language understanding and vision tasks, offering superior performance compared to previous Chinese multimodal models. It achieves significant improvements in zero-shot capabilities and cross-modal retrieval tasks.

Q: What are the recommended use cases?

The model excels in image-text matching, zero-shot image classification, and cross-modal retrieval tasks in Chinese. It's particularly suitable for applications like content recommendation, visual search, and automatic image captioning in Chinese language contexts.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.