clip-variants

mlunar

OpenAI CLIP model variants converted to ONNX format, offering multiple architectures (ResNet/ViT) with different precision types (float32/16, qint8, quint8)

Property	Value
License	MIT
Format	ONNX
Supported Architectures	ResNet-50/101, ViT-B/16, ViT-B/32, ViT-L/14
Precision Types	float32, float16, qint8, quint8

What is clip-variants?

CLIP-variants is a comprehensive collection of OpenAI's CLIP models converted into ONNX format, offering multiple architecture variants and precision types. The model provides both visual and textual processing capabilities, making it suitable for multimodal tasks.

Implementation Details

The repository contains converted versions of all available OpenAI CLIP models, split into two separate modes: visual and textual processing. Each model variant is available in multiple precision types to accommodate different performance and size requirements.

Supports both ResNet and Vision Transformer (ViT) architectures
Includes multiple model sizes from compact to large-scale
Offers various precision types for flexibility in deployment
Provides complete ONNX compatibility

Core Capabilities

Zero-shot image classification
Visual-textual alignment
Multi-modal feature extraction
Flexible deployment options with different precision types
Support for both CNN and Transformer architectures

Frequently Asked Questions

Q: What makes this model unique?

This model collection provides ONNX-converted variants of CLIP, making it easier to deploy in various environments while offering multiple precision options for balancing performance and resource usage.

Q: What are the recommended use cases?

The models are suitable for zero-shot image classification, visual-textual alignment tasks, and general multimodal applications where image and text understanding is required. However, careful evaluation is recommended for specific deployment contexts.