CLIP-RSICD-V2
Property | Value |
---|---|
Release Date | July 2021 |
Architecture | ViT-B/32 Transformer + Masked Self-Attention Transformer |
Training Data | RSICD, UCM, and Sydney datasets |
Performance | 88.3% accuracy at k=1 |
What is clip-rsicd-v2?
CLIP-RSICD-V2 is a specialized fine-tuned version of OpenAI's CLIP model, specifically optimized for remote sensing image analysis. It combines a Vision Transformer (ViT) architecture for image processing with a text encoder to enable powerful zero-shot classification and image-text retrieval capabilities for aerial and satellite imagery.
Implementation Details
The model was trained using a batch size of 1024 on a TPU-v3-8, utilizing the Adafactor optimizer with linear warmup and decay, reaching a peak learning rate of 1e-4. It implements a contrastive learning approach to maximize the similarity between image-text pairs.
- Built on CLIP ViT-B/32 architecture
- Trained on multiple remote sensing datasets
- Optimized for zero-shot classification
- Supports both text-to-image and image-to-image retrieval
Core Capabilities
- Zero-shot image classification for remote sensing images
- Text-to-image retrieval with high accuracy
- Image-to-image similarity matching
- Significant performance improvement over base CLIP (88.3% vs 57.2% at k=1)
Frequently Asked Questions
Q: What makes this model unique?
This model specializes in remote sensing imagery, significantly outperforming the original CLIP model in this domain, with accuracy improvements of over 30% for k=1 classification tasks.
Q: What are the recommended use cases?
The model is particularly suited for defense and law enforcement applications, climate change monitoring, and any scenario requiring analysis of aerial or satellite imagery. It excels at helping humans search through large collections of remote sensing images.