dinov2-with-registers-base

dinov2-with-registers-base

facebook

Vision Transformer model with innovative register tokens for improved attention maps and feature extraction, developed by Facebook

PropertyValue
AuthorFacebook
PaperVision Transformers Need Registers
Model TypeVision Transformer (ViT)
Primary UseSelf-supervised image feature extraction

What is dinov2-with-registers-base?

DINOv2 with Registers is an innovative enhancement to the Vision Transformer (ViT) architecture, developed by Facebook. This model introduces special "register" tokens during pre-training to address attention map artifacts commonly found in traditional ViTs. The base variant represents a balanced approach between computational efficiency and performance.

Implementation Details

The model builds upon the BERT-like transformer encoder architecture but introduces a crucial innovation: dedicated register tokens that are only used during pre-training and discarded afterward. This approach effectively resolves attention map artifacts while improving overall performance.

  • Implements register tokens for cleaner attention maps
  • Pre-trained using self-supervised learning
  • Features interpretable attention mechanisms
  • Designed for feature extraction without fine-tuned heads

Core Capabilities

  • High-quality image feature extraction
  • Clean, artifact-free attention maps
  • Flexible integration with downstream tasks
  • Effective representation learning for transfer learning

Frequently Asked Questions

Q: What makes this model unique?

The model's distinctive feature is its use of register tokens during pre-training, which effectively eliminates attention map artifacts while improving overall performance and interpretability. This innovation represents a significant advancement in Vision Transformer architecture.

Q: What are the recommended use cases?

The model is primarily designed for feature extraction tasks. It can be used as a backbone for various downstream computer vision tasks by adding a task-specific head (like a linear layer) on top of the pre-trained encoder. It's particularly effective for tasks requiring high-quality image representations.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026