LongCLIP-Registers-Gated_MLP-ViT-L-14

zer0int

Enhanced Long-CLIP model with 248 token input capacity, featuring register tokens and gated MLPs. Significantly reduces modality gap and improves retrieval performance.

Property	Value
Author	zer0int
Token Capacity	248 tokens
Model Type	CLIP Text-Image Encoder
Base Architecture	ViT-L/14 with Register Tokens
Model URL	huggingface.co/zer0int/LongCLIP-Registers-Gated_MLP-ViT-L-14

What is LongCLIP-Registers-Gated_MLP-ViT-L-14?

This is an enhanced version of the LongCLIP model that introduces register tokens and gated MLPs to improve performance and reduce the modality gap between text and image representations. The model extends CLIP's token limit from 77 to 248 tokens while significantly improving retrieval performance and maintaining strong classification capabilities.

Implementation Details

The model implements several key architectural improvements over the original LongCLIP-L, including register tokens and gated MLPs in the ViT architecture. It achieves substantial improvements in modality gap reduction (0.5781 vs 1.0672) and cross-modal retrieval performance.

Increased token limit to 248 tokens
Enhanced ViT architecture with register tokens
Improved modality alignment through gated MLPs
Compatible with standard CLIP interfaces

Core Capabilities

Superior MSCOCO Image Retrieval (Recall@5: 0.3663)
Enhanced Text Retrieval Performance (Recall@5: 0.5398)
Strong ImageNet/ObjectNet Zero-Shot Performance (MVT: 0.8724)
Reduced Modality Gap (0.5781)
Improved Image-Text Cosine Similarity (Mean: 0.4711)

Frequently Asked Questions

Q: What makes this model unique?

This model uniquely combines extended token capacity with architectural improvements like register tokens and gated MLPs, resulting in significantly better modality alignment and retrieval performance while maintaining classification accuracy.

Q: What are the recommended use cases?

The model is particularly well-suited for text-to-image generation, video processing, and applications requiring longer text inputs. It's designed as a drop-in replacement for CLIP-L in systems like ComfyUI.