Document Image Transformer (DiT) Large

Property	Value
Author	Microsoft
Research Paper	DiT: Self-supervised Pre-training for Document Image Transformer
Framework	PyTorch
Task	Document Image Classification

What is dit-large-finetuned-rvlcdip?

The Document Image Transformer (DiT) Large is a sophisticated transformer-based model specifically designed for document image analysis. Pre-trained on the massive IIT-CDIP dataset containing 42 million document images and fine-tuned on RVL-CDIP with 400,000 grayscale images across 16 classes, this model represents a significant advancement in document understanding technology.

Implementation Details

DiT follows the BEiT architecture and processes images as sequences of 16x16 fixed-size patches. The model employs a self-supervised pre-training approach, predicting visual tokens from a discrete VAE encoder based on masked patches. It incorporates absolute position embeddings and utilizes a transformer encoder architecture.

Pre-trained on 42 million document images
Fine-tuned on 400,000 RVL-CDIP images
16-class classification capability
Patch-based image processing (16x16)

Core Capabilities

Document image classification across 16 categories
Feature extraction for downstream tasks
Document layout analysis
Table detection capabilities
Vector space encoding of document images

Frequently Asked Questions

Q: What makes this model unique?

This model stands out due to its extensive pre-training on 42 million documents and specialized fine-tuning for document classification. Its architecture, identical to BEiT, has proven highly effective for document understanding tasks.

Q: What are the recommended use cases?

The model is particularly well-suited for document classification, layout analysis, and feature extraction tasks. It's designed for processing business documents, forms, and other structured documents within its 16 predefined classes.

dit-large-finetuned-rvlcdip