CLIP-Registers-Gated_MLP-ViT-L-14

Property	Value
Parameter Count	~450M parameters
Model Type	Vision-Language Model
Architecture	Modified CLIP with Register Tokens
Author	zer0int
Repository	GitHub Repository

What is CLIP-Registers-Gated_MLP-ViT-L-14?

This is an enhanced version of OpenAI's CLIP model that introduces register tokens and gated ReLU MLPs to significantly reduce the modality gap between text and image representations. The model adds approximately 20M parameters to the original CLIP architecture, bringing the total to around 450M parameters.

Implementation Details

The model implements several key architectural modifications to the standard CLIP ViT-L/14:

Addition of 4 register tokens to the Vision Transformer
Implementation of gated ReLU MLPs in each layer
Introduction of a final Fusion MLP
Significant reduction in modality gap metrics (Euclidean Gap: 0.5395, JSD: 0.1303)

Core Capabilities

Superior performance on VOC-2007 multilabel classification (mAP: 0.8471)
Enhanced MSCOCO retrieval capabilities (Image Recall@5: 0.3532, Text Recall@5: 0.5278)
Improved zero-shot performance on MVT ImageNet/ObjectNet (Accuracy: 0.8830)
Significantly reduced modality gap compared to standard CLIP

Frequently Asked Questions

Q: What makes this model unique?

The model's uniqueness lies in its register token architecture and gated MLPs, which significantly reduce the modality gap while maintaining or improving performance across various vision tasks. The balanced checkpoint (ckpt12) offers an optimal trade-off between accuracy and modality alignment.

Q: What are the recommended use cases?

This model is particularly well-suited for text-to-image applications, text-to-video AI, and any tasks requiring strong alignment between text and image modalities. It's especially effective for zero-shot classification and image-text retrieval tasks.