dpt-large

Maintained By
Intel

DPT-Large (MiDaS 3.0)

PropertyValue
Parameter Count342M
LicenseApache 2.0
PaperVision Transformers for Dense Prediction
AuthorIntel
Training DataMIX 6 (1.4M images)

What is dpt-large?

DPT-Large is a state-of-the-art Dense Prediction Transformer model designed for monocular depth estimation. Developed by Intel, it represents a significant advancement in computer vision, utilizing the Vision Transformer (ViT) architecture as its backbone with additional neck and head components specifically optimized for depth estimation tasks.

Implementation Details

The model leverages the power of transformer architecture and has been trained on an extensive dataset of 1.4 million images. It provides exceptional zero-shot transfer capabilities, achieving a score of 10.82 on the MIX-6 benchmark, representing a 13.2% improvement over previous approaches.

  • Based on Vision Transformer (ViT) architecture
  • Trained on MIX 6 dataset with ImageNet pre-training
  • Processes images with longer side resized to 384 pixels
  • Supports PyTorch framework with F32 tensor type

Core Capabilities

  • Zero-shot monocular depth estimation
  • Superior performance across multiple benchmarks (ETH3D, Sintel, KITTI, NYU, TUM)
  • Efficient processing through transformer architecture
  • Easy integration through Hugging Face Transformers pipeline

Frequently Asked Questions

Q: What makes this model unique?

DPT-Large stands out for its transformer-based architecture and exceptional zero-shot transfer capabilities, showing significant improvements over previous state-of-the-art models across multiple benchmarks. It's particularly notable for achieving a 31.2% improvement in ETH3D AbsRel and 64.6% improvement in KITTI metrics.

Q: What are the recommended use cases?

The model is primarily designed for zero-shot monocular depth estimation tasks. While it can be used out-of-the-box for depth estimation, it's recommended to fine-tune the model for specific use cases. It's particularly suitable for applications requiring accurate depth estimation from single images, such as 3D scene understanding, robotics, and augmented reality.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.