CLIP-Registers-Gated_MLP-ViT-L-14

Maintained By
zer0int

CLIP-Registers-Gated_MLP-ViT-L-14

PropertyValue
Parameter Count~450M parameters
Model TypeVision-Language Model
ArchitectureModified CLIP with Register Tokens
Authorzer0int
RepositoryGitHub Repository

What is CLIP-Registers-Gated_MLP-ViT-L-14?

This is an enhanced version of OpenAI's CLIP model that introduces register tokens and gated ReLU MLPs to significantly reduce the modality gap between text and image representations. The model adds approximately 20M parameters to the original CLIP architecture, bringing the total to around 450M parameters.

Implementation Details

The model implements several key architectural modifications to the standard CLIP ViT-L/14:

  • Addition of 4 register tokens to the Vision Transformer
  • Implementation of gated ReLU MLPs in each layer
  • Introduction of a final Fusion MLP
  • Significant reduction in modality gap metrics (Euclidean Gap: 0.5395, JSD: 0.1303)

Core Capabilities

  • Superior performance on VOC-2007 multilabel classification (mAP: 0.8471)
  • Enhanced MSCOCO retrieval capabilities (Image Recall@5: 0.3532, Text Recall@5: 0.5278)
  • Improved zero-shot performance on MVT ImageNet/ObjectNet (Accuracy: 0.8830)
  • Significantly reduced modality gap compared to standard CLIP

Frequently Asked Questions

Q: What makes this model unique?

The model's uniqueness lies in its register token architecture and gated MLPs, which significantly reduce the modality gap while maintaining or improving performance across various vision tasks. The balanced checkpoint (ckpt12) offers an optimal trade-off between accuracy and modality alignment.

Q: What are the recommended use cases?

This model is particularly well-suited for text-to-image applications, text-to-video AI, and any tasks requiring strong alignment between text and image modalities. It's especially effective for zero-shot classification and image-text retrieval tasks.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.