dit-base

dit-base

microsoft

Document Image Transformer (DiT) base model - A BERT-like transformer for document image processing, pre-trained on 42M documents using self-supervised learning.

PropertyValue
AuthorMicrosoft
PaperDiT: Self-supervised Pre-training for Document Image Transformer
Downloads531,813
FrameworkPyTorch

What is dit-base?

DiT-base is a transformer-based model specifically designed for document image processing. Built on the architecture of BEiT, it has been pre-trained on the massive IIT-CDIP dataset containing 42 million document images using self-supervised learning techniques. The model processes images by dividing them into 16x16 pixel patches and learns to predict visual tokens from masked regions.

Implementation Details

The model implements a BERT-like transformer encoder architecture with the following key technical aspects:

  • Processes images as sequences of 16x16 fixed-size patches
  • Uses linear embedding for patch processing
  • Incorporates absolute position embeddings
  • Employs discrete VAE (dVAE) encoder for visual token prediction
  • Utilizes masked patch prediction as pre-training objective

Core Capabilities

  • Document image encoding into vector space representations
  • Foundation for document image classification tasks
  • Table detection in documents
  • Document layout analysis
  • Support for fine-tuning on specific document processing tasks

Frequently Asked Questions

Q: What makes this model unique?

DiT-base stands out due to its massive pre-training on 42 million document images and its ability to understand document structure through self-supervised learning, making it particularly effective for document-specific tasks compared to general image models.

Q: What are the recommended use cases?

The model is best suited for document processing tasks such as classification, layout analysis, and table detection. It's designed to be fine-tuned rather than used as a standalone model, making it ideal for organizations with specific document processing needs.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026