LongCLIP-GmP-ViT-L-14

Property	Value
Parameter Count	428M
Model Type	CLIP
Architecture	Vision Transformer Large/14
License	MIT
Tensor Type	F32

What is LongCLIP-GmP-ViT-L-14?

LongCLIP-GmP-ViT-L-14 is an advanced fine-tuned version of the original Long-CLIP model, incorporating Geometric Parametrization (GmP) to enhance performance. This model extends CLIP's capabilities by supporting longer text sequences of up to 248 tokens, compared to the standard 77 tokens, while achieving an improved ImageNet/ObjectNet accuracy of 0.89.

Implementation Details

The model implements a sophisticated weight decomposition strategy using geometric parametrization, which preserves weight vectors' directionality and magnitude through radial and angular components. This approach has proven particularly effective for maintaining model stability during fine-tuning.

Supports 248 token sequences
Implements Geometric Linear layers
Compatible with Flux.1, SDXL, and Stable Diffusion
Includes custom loss with label smoothing

Core Capabilities

Enhanced zero-shot image classification
Improved text-image matching accuracy
Superior performance on longer text sequences
Better cosine similarities for image-text pairs

Frequently Asked Questions

Q: What makes this model unique?

The model's unique feature is its combination of extended token length support (248 tokens) with geometric parametrization, resulting in significantly improved accuracy while maintaining stability in fine-tuning scenarios.

Q: What are the recommended use cases?

The model is particularly well-suited for applications requiring longer text descriptions in image-text matching tasks, zero-shot image classification, and as a text encoder for various stable diffusion models including SDXL and Flux.1.