DUSt3R_ViTLarge_BaseDecoder_512_dpt

Property	Value
Parameter Count	571M
Model Type	Image-to-3D
Architecture	ViT-Large encoder with ViT-Base decoder
License	CC BY-NC-SA 4.0
Paper	arXiv:2312.14132

What is DUSt3R_ViTLarge_BaseDecoder_512_dpt?

DUSt3R is a state-of-the-art geometric 3D vision model developed by NAVER Labs. This specific variant uses a ViT-Large encoder combined with a ViT-Base decoder, optimized for 512px resolution inputs with DPT (Dense Prediction Transformer) architecture.

Implementation Details

The model operates on multiple training resolutions (512x384, 512x336, 512x288, 512x256, 512x160) and employs an asymmetric architecture through the AsymmetricCroCo3DStereo implementation. It utilizes PyTorch and supports F32 tensor operations.

Advanced DPT head architecture for dense predictions
Hybrid architecture combining ViT-Large encoder with ViT-Base decoder
Multi-resolution training support
PyTorch-based implementation with safetensors support

Core Capabilities

High-quality 3D geometric vision processing
Efficient handling of various input resolutions
Dense prediction capabilities through DPT architecture
Optimized for both accuracy and computational efficiency

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its asymmetric architecture combining ViT-Large encoder with ViT-Base decoder, optimized for 512px resolution, making it particularly effective for geometric 3D vision tasks while maintaining computational efficiency.

Q: What are the recommended use cases?

The model is ideal for applications requiring geometric 3D vision processing, including 3D reconstruction, depth estimation, and stereo vision tasks. It's particularly well-suited for scenarios requiring high-resolution input processing up to 512px.