LongCLIP-GmP-ViT-L-14

Maintained By
zer0int

LongCLIP-GmP-ViT-L-14

PropertyValue
Parameter Count428M
Model TypeCLIP
ArchitectureVision Transformer Large/14
LicenseMIT
Tensor TypeF32

What is LongCLIP-GmP-ViT-L-14?

LongCLIP-GmP-ViT-L-14 is an advanced fine-tuned version of the original Long-CLIP model, incorporating Geometric Parametrization (GmP) to enhance performance. This model extends CLIP's capabilities by supporting longer text sequences of up to 248 tokens, compared to the standard 77 tokens, while achieving an improved ImageNet/ObjectNet accuracy of 0.89.

Implementation Details

The model implements a sophisticated weight decomposition strategy using geometric parametrization, which preserves weight vectors' directionality and magnitude through radial and angular components. This approach has proven particularly effective for maintaining model stability during fine-tuning.

  • Supports 248 token sequences
  • Implements Geometric Linear layers
  • Compatible with Flux.1, SDXL, and Stable Diffusion
  • Includes custom loss with label smoothing

Core Capabilities

  • Enhanced zero-shot image classification
  • Improved text-image matching accuracy
  • Superior performance on longer text sequences
  • Better cosine similarities for image-text pairs

Frequently Asked Questions

Q: What makes this model unique?

The model's unique feature is its combination of extended token length support (248 tokens) with geometric parametrization, resulting in significantly improved accuracy while maintaining stability in fine-tuning scenarios.

Q: What are the recommended use cases?

The model is particularly well-suited for applications requiring longer text descriptions in image-text matching tasks, zero-shot image classification, and as a text encoder for various stable diffusion models including SDXL and Flux.1.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.