CLIP-Registers-Gated_MLP-ViT-L-14

CLIP-Registers-Gated_MLP-ViT-L-14

zer0int

Enhanced CLIP model with register tokens and gated MLPs, offering improved modality gap reduction and better performance across vision tasks. +20M params over standard CLIP.

PropertyValue
Parameter Count~450M parameters
Model TypeVision-Language Model
ArchitectureModified CLIP with Register Tokens
Authorzer0int
RepositoryGitHub Repository

What is CLIP-Registers-Gated_MLP-ViT-L-14?

This is an enhanced version of OpenAI's CLIP model that introduces register tokens and gated ReLU MLPs to significantly reduce the modality gap between text and image representations. The model adds approximately 20M parameters to the original CLIP architecture, bringing the total to around 450M parameters.

Implementation Details

The model implements several key architectural modifications to the standard CLIP ViT-L/14:

  • Addition of 4 register tokens to the Vision Transformer
  • Implementation of gated ReLU MLPs in each layer
  • Introduction of a final Fusion MLP
  • Significant reduction in modality gap metrics (Euclidean Gap: 0.5395, JSD: 0.1303)

Core Capabilities

  • Superior performance on VOC-2007 multilabel classification (mAP: 0.8471)
  • Enhanced MSCOCO retrieval capabilities (Image Recall@5: 0.3532, Text Recall@5: 0.5278)
  • Improved zero-shot performance on MVT ImageNet/ObjectNet (Accuracy: 0.8830)
  • Significantly reduced modality gap compared to standard CLIP

Frequently Asked Questions

Q: What makes this model unique?

The model's uniqueness lies in its register token architecture and gated MLPs, which significantly reduce the modality gap while maintaining or improving performance across various vision tasks. The balanced checkpoint (ckpt12) offers an optimal trade-off between accuracy and modality alignment.

Q: What are the recommended use cases?

This model is particularly well-suited for text-to-image applications, text-to-video AI, and any tasks requiring strong alignment between text and image modalities. It's especially effective for zero-shot classification and image-text retrieval tasks.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026