StreetCLIP
Property | Value |
---|---|
Base Model | CLIP ViT-Large-Patch14-336 |
License | CC-BY-NC-4.0 |
Paper | arXiv:2302.00275 |
Training Data | 1.1M street-level images from 101 countries |
What is StreetCLIP?
StreetCLIP is a revolutionary foundation model designed for open-domain image geolocalization and geographic analysis. Built upon OpenAI's CLIP architecture, it has been specifically trained on 1.1 million street-level urban and rural geo-tagged images, enabling state-of-the-art performance in zero-shot geographic classification tasks.
Implementation Details
The model utilizes a ViT architecture with 14x14 pixel patches and 336 pixel input images. It employs a unique synthetic caption pretraining method that enables superior zero-shot learning capabilities in geographic contexts. Training was conducted on 3 NVIDIA A100 GPUs for 3 epochs using AdamW optimizer with a 1e-6 learning rate.
- Zero-shot classification architecture
- Domain-specific caption template training
- Hierarchical linear probing for evaluation
- Outperforms supervised models trained on millions of images
Core Capabilities
- Geographic location prediction from street-level imagery
- Urban and rural scene understanding
- Building type and quality analysis
- Infrastructure assessment
- Environmental monitoring and vegetation mapping
Frequently Asked Questions
Q: What makes this model unique?
StreetCLIP's distinctive feature is its ability to perform zero-shot geographic classification without requiring explicit training on target locations. It achieves this through its innovative synthetic caption pretraining method and comprehensive training on diverse street-level imagery.
Q: What are the recommended use cases?
The model excels in various applications including urban planning, infrastructure assessment, environmental monitoring, and general geographic analysis. It's particularly effective for analyzing building quality, road conditions, vegetation mapping, and natural disaster impact assessment.