DUSt3R_ViTLarge_BaseDecoder_512_dpt
Property | Value |
---|---|
Parameter Count | 571M |
Model Type | Image-to-3D |
Architecture | ViT-Large encoder with ViT-Base decoder |
License | CC BY-NC-SA 4.0 |
Paper | arXiv:2312.14132 |
What is DUSt3R_ViTLarge_BaseDecoder_512_dpt?
DUSt3R is a state-of-the-art geometric 3D vision model developed by NAVER Labs. This specific variant uses a ViT-Large encoder combined with a ViT-Base decoder, optimized for 512px resolution inputs with DPT (Dense Prediction Transformer) architecture.
Implementation Details
The model operates on multiple training resolutions (512x384, 512x336, 512x288, 512x256, 512x160) and employs an asymmetric architecture through the AsymmetricCroCo3DStereo implementation. It utilizes PyTorch and supports F32 tensor operations.
- Advanced DPT head architecture for dense predictions
- Hybrid architecture combining ViT-Large encoder with ViT-Base decoder
- Multi-resolution training support
- PyTorch-based implementation with safetensors support
Core Capabilities
- High-quality 3D geometric vision processing
- Efficient handling of various input resolutions
- Dense prediction capabilities through DPT architecture
- Optimized for both accuracy and computational efficiency
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for its asymmetric architecture combining ViT-Large encoder with ViT-Base decoder, optimized for 512px resolution, making it particularly effective for geometric 3D vision tasks while maintaining computational efficiency.
Q: What are the recommended use cases?
The model is ideal for applications requiring geometric 3D vision processing, including 3D reconstruction, depth estimation, and stereo vision tasks. It's particularly well-suited for scenarios requiring high-resolution input processing up to 512px.