DINOv2 with Registers Large

Property	Value
Developer	Facebook
Model Type	Vision Transformer (ViT)
Paper	Vision Transformers Need Registers
Primary Use	Self-supervised image feature extraction

What is dinov2-with-registers-large?

DINOv2 with Registers Large is an advanced Vision Transformer model that introduces a novel concept of "register" tokens to address attention map artifacts in traditional ViTs. This large-sized model builds upon the successful DINOv2 architecture while incorporating dedicated register tokens during pre-training to achieve cleaner, more interpretable attention maps and enhanced performance.

Implementation Details

The model is implemented as a transformer encoder that processes images without requiring labels for pre-training. It distinguishes itself by adding special register tokens that are only used during the pre-training phase and discarded afterward. This approach effectively resolves the common issue of artifacts in attention maps that plague standard ViTs.

Built on BERT-like transformer encoder architecture
Incorporates register tokens for improved attention mechanism
Provides clean, artifact-free attention maps
Supports feature extraction via [CLS] token outputs

Core Capabilities

Self-supervised image feature extraction
High-quality visual representation learning
Interpretable attention mapping
Flexible integration with downstream tasks via linear layer addition

Frequently Asked Questions

Q: What makes this model unique?

The model's distinctive feature is its use of register tokens during pre-training, which effectively eliminates attention map artifacts while improving overall performance. This innovation leads to more interpretable and cleaner attention mechanisms.

Q: What are the recommended use cases?

The model is best suited for feature extraction tasks and can be adapted for various downstream computer vision applications. It's particularly effective when used as a backbone for transfer learning, where the pre-trained features can be leveraged for specific tasks through fine-tuning.