LongCLIP-Registers-Gated_MLP-ViT-L-14

Maintained By
zer0int

LongCLIP-Registers-Gated_MLP-ViT-L-14

PropertyValue
Authorzer0int
Token Capacity248 tokens
Model TypeCLIP Text-Image Encoder
Base ArchitectureViT-L/14 with Register Tokens
Model URLhuggingface.co/zer0int/LongCLIP-Registers-Gated_MLP-ViT-L-14

What is LongCLIP-Registers-Gated_MLP-ViT-L-14?

This is an enhanced version of the LongCLIP model that introduces register tokens and gated MLPs to improve performance and reduce the modality gap between text and image representations. The model extends CLIP's token limit from 77 to 248 tokens while significantly improving retrieval performance and maintaining strong classification capabilities.

Implementation Details

The model implements several key architectural improvements over the original LongCLIP-L, including register tokens and gated MLPs in the ViT architecture. It achieves substantial improvements in modality gap reduction (0.5781 vs 1.0672) and cross-modal retrieval performance.

  • Increased token limit to 248 tokens
  • Enhanced ViT architecture with register tokens
  • Improved modality alignment through gated MLPs
  • Compatible with standard CLIP interfaces

Core Capabilities

  • Superior MSCOCO Image Retrieval (Recall@5: 0.3663)
  • Enhanced Text Retrieval Performance (Recall@5: 0.5398)
  • Strong ImageNet/ObjectNet Zero-Shot Performance (MVT: 0.8724)
  • Reduced Modality Gap (0.5781)
  • Improved Image-Text Cosine Similarity (Mean: 0.4711)

Frequently Asked Questions

Q: What makes this model unique?

This model uniquely combines extended token capacity with architectural improvements like register tokens and gated MLPs, resulting in significantly better modality alignment and retrieval performance while maintaining classification accuracy.

Q: What are the recommended use cases?

The model is particularly well-suited for text-to-image generation, video processing, and applications requiring longer text inputs. It's designed as a drop-in replacement for CLIP-L in systems like ComfyUI.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.