clip-rsicd-v2

Maintained By
flax-community

CLIP-RSICD-V2

PropertyValue
Release DateJuly 2021
ArchitectureViT-B/32 Transformer + Masked Self-Attention Transformer
Training DataRSICD, UCM, and Sydney datasets
Performance88.3% accuracy at k=1

What is clip-rsicd-v2?

CLIP-RSICD-V2 is a specialized fine-tuned version of OpenAI's CLIP model, specifically optimized for remote sensing image analysis. It combines a Vision Transformer (ViT) architecture for image processing with a text encoder to enable powerful zero-shot classification and image-text retrieval capabilities for aerial and satellite imagery.

Implementation Details

The model was trained using a batch size of 1024 on a TPU-v3-8, utilizing the Adafactor optimizer with linear warmup and decay, reaching a peak learning rate of 1e-4. It implements a contrastive learning approach to maximize the similarity between image-text pairs.

  • Built on CLIP ViT-B/32 architecture
  • Trained on multiple remote sensing datasets
  • Optimized for zero-shot classification
  • Supports both text-to-image and image-to-image retrieval

Core Capabilities

  • Zero-shot image classification for remote sensing images
  • Text-to-image retrieval with high accuracy
  • Image-to-image similarity matching
  • Significant performance improvement over base CLIP (88.3% vs 57.2% at k=1)

Frequently Asked Questions

Q: What makes this model unique?

This model specializes in remote sensing imagery, significantly outperforming the original CLIP model in this domain, with accuracy improvements of over 30% for k=1 classification tasks.

Q: What are the recommended use cases?

The model is particularly suited for defense and law enforcement applications, climate change monitoring, and any scenario requiring analysis of aerial or satellite imagery. It excels at helping humans search through large collections of remote sensing images.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.