LongCLIP-Registers-Gated_MLP-ViT-L-14

LongCLIP-Registers-Gated_MLP-ViT-L-14

zer0int

Enhanced Long-CLIP model with 248 token input capacity, featuring register tokens and gated MLPs. Significantly reduces modality gap and improves retrieval performance.

PropertyValue
Authorzer0int
Token Capacity248 tokens
Model TypeCLIP Text-Image Encoder
Base ArchitectureViT-L/14 with Register Tokens
Model URLhuggingface.co/zer0int/LongCLIP-Registers-Gated_MLP-ViT-L-14

What is LongCLIP-Registers-Gated_MLP-ViT-L-14?

This is an enhanced version of the LongCLIP model that introduces register tokens and gated MLPs to improve performance and reduce the modality gap between text and image representations. The model extends CLIP's token limit from 77 to 248 tokens while significantly improving retrieval performance and maintaining strong classification capabilities.

Implementation Details

The model implements several key architectural improvements over the original LongCLIP-L, including register tokens and gated MLPs in the ViT architecture. It achieves substantial improvements in modality gap reduction (0.5781 vs 1.0672) and cross-modal retrieval performance.

  • Increased token limit to 248 tokens
  • Enhanced ViT architecture with register tokens
  • Improved modality alignment through gated MLPs
  • Compatible with standard CLIP interfaces

Core Capabilities

  • Superior MSCOCO Image Retrieval (Recall@5: 0.3663)
  • Enhanced Text Retrieval Performance (Recall@5: 0.5398)
  • Strong ImageNet/ObjectNet Zero-Shot Performance (MVT: 0.8724)
  • Reduced Modality Gap (0.5781)
  • Improved Image-Text Cosine Similarity (Mean: 0.4711)

Frequently Asked Questions

Q: What makes this model unique?

This model uniquely combines extended token capacity with architectural improvements like register tokens and gated MLPs, resulting in significantly better modality alignment and retrieval performance while maintaining classification accuracy.

Q: What are the recommended use cases?

The model is particularly well-suited for text-to-image generation, video processing, and applications requiring longer text inputs. It's designed as a drop-in replacement for CLIP-L in systems like ComfyUI.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026