siglip-large-patch16-384

google

A large-scale vision-language model (652M params) using sigmoid loss for improved image-text understanding. Excels at zero-shot classification with 384x384 resolution.

Property	Value
Parameter Count	652M
License	Apache 2.0
Training Data	WebLI Dataset
Resolution	384x384
Paper	Sigmoid Loss for Language Image Pre-Training

What is siglip-large-patch16-384?

SigLIP is an advanced vision-language model that builds upon CLIP's architecture while introducing a revolutionary sigmoid loss function. This large variant, trained on 384x384 resolution images, represents a significant advancement in multimodal AI, particularly excelling at zero-shot image classification tasks.

Implementation Details

The model was trained on WebLI dataset using 16 TPU-v4 chips over three days. It processes images by resizing them to 384x384 resolution and normalizing them across RGB channels (mean: 0.5, std: 0.5). Text inputs are tokenized and padded to 64 tokens.

Improved loss function that doesn't require global similarity normalization
Supports larger batch sizes while maintaining performance
Processes both image and text inputs for multimodal understanding

Core Capabilities

Zero-shot image classification
Image-text retrieval
Multimodal understanding with high accuracy
Efficient processing of high-resolution images

Frequently Asked Questions

Q: What makes this model unique?

SigLIP's key innovation lies in its sigmoid loss function, which operates directly on image-text pairs without requiring global normalization. This allows for better scaling and improved performance even with smaller batch sizes compared to traditional CLIP models.

Q: What are the recommended use cases?

The model excels at zero-shot image classification and image-text retrieval tasks. It's particularly useful for applications requiring high-resolution image understanding (384x384) and flexible deployment scenarios where batch size optimization is crucial.