CLIP-RSICD-V2

Property	Value
Release Date	July 2021
Architecture	ViT-B/32 Transformer + Masked Self-Attention Transformer
Training Data	RSICD, UCM, and Sydney datasets
Performance	88.3% accuracy at k=1

What is clip-rsicd-v2?

CLIP-RSICD-V2 is a specialized fine-tuned version of OpenAI's CLIP model, specifically optimized for remote sensing image analysis. It combines a Vision Transformer (ViT) architecture for image processing with a text encoder to enable powerful zero-shot classification and image-text retrieval capabilities for aerial and satellite imagery.

Implementation Details

The model was trained using a batch size of 1024 on a TPU-v3-8, utilizing the Adafactor optimizer with linear warmup and decay, reaching a peak learning rate of 1e-4. It implements a contrastive learning approach to maximize the similarity between image-text pairs.

Built on CLIP ViT-B/32 architecture
Trained on multiple remote sensing datasets
Optimized for zero-shot classification
Supports both text-to-image and image-to-image retrieval

Core Capabilities

Zero-shot image classification for remote sensing images
Text-to-image retrieval with high accuracy
Image-to-image similarity matching
Significant performance improvement over base CLIP (88.3% vs 57.2% at k=1)

Frequently Asked Questions

Q: What makes this model unique?

This model specializes in remote sensing imagery, significantly outperforming the original CLIP model in this domain, with accuracy improvements of over 30% for k=1 classification tasks.

Q: What are the recommended use cases?

The model is particularly suited for defense and law enforcement applications, climate change monitoring, and any scenario requiring analysis of aerial or satellite imagery. It excels at helping humans search through large collections of remote sensing images.

clip-rsicd-v2