dit-base

dit-base

microsoft

Document Image Transformer (DiT) base model - A BERT-like transformer for document image processing, pre-trained on 42M documents using self-supervised learning.

PropertyValue
AuthorMicrosoft
PaperDiT: Self-supervised Pre-training for Document Image Transformer
Downloads531,813
FrameworkPyTorch

What is dit-base?

DiT-base is a transformer-based model specifically designed for document image processing. Built on the architecture of BEiT, it has been pre-trained on the massive IIT-CDIP dataset containing 42 million document images using self-supervised learning techniques. The model processes images by dividing them into 16x16 pixel patches and learns to predict visual tokens from masked regions.

Implementation Details

The model implements a BERT-like transformer encoder architecture with the following key technical aspects:

  • Processes images as sequences of 16x16 fixed-size patches
  • Uses linear embedding for patch processing
  • Incorporates absolute position embeddings
  • Employs discrete VAE (dVAE) encoder for visual token prediction
  • Utilizes masked patch prediction as pre-training objective

Core Capabilities

  • Document image encoding into vector space representations
  • Foundation for document image classification tasks
  • Table detection in documents
  • Document layout analysis
  • Support for fine-tuning on specific document processing tasks

Frequently Asked Questions

Q: What makes this model unique?

DiT-base stands out due to its massive pre-training on 42 million document images and its ability to understand document structure through self-supervised learning, making it particularly effective for document-specific tasks compared to general image models.

Q: What are the recommended use cases?

The model is best suited for document processing tasks such as classification, layout analysis, and table detection. It's designed to be fine-tuned rather than used as a standalone model, making it ideal for organizations with specific document processing needs.

Socials
Integrations
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026