Document Image Transformer (DiT) Large
Property | Value |
---|---|
Author | Microsoft |
Research Paper | DiT: Self-supervised Pre-training for Document Image Transformer |
Framework | PyTorch |
Task | Document Image Classification |
What is dit-large-finetuned-rvlcdip?
The Document Image Transformer (DiT) Large is a sophisticated transformer-based model specifically designed for document image analysis. Pre-trained on the massive IIT-CDIP dataset containing 42 million document images and fine-tuned on RVL-CDIP with 400,000 grayscale images across 16 classes, this model represents a significant advancement in document understanding technology.
Implementation Details
DiT follows the BEiT architecture and processes images as sequences of 16x16 fixed-size patches. The model employs a self-supervised pre-training approach, predicting visual tokens from a discrete VAE encoder based on masked patches. It incorporates absolute position embeddings and utilizes a transformer encoder architecture.
- Pre-trained on 42 million document images
- Fine-tuned on 400,000 RVL-CDIP images
- 16-class classification capability
- Patch-based image processing (16x16)
Core Capabilities
- Document image classification across 16 categories
- Feature extraction for downstream tasks
- Document layout analysis
- Table detection capabilities
- Vector space encoding of document images
Frequently Asked Questions
Q: What makes this model unique?
This model stands out due to its extensive pre-training on 42 million documents and specialized fine-tuning for document classification. Its architecture, identical to BEiT, has proven highly effective for document understanding tasks.
Q: What are the recommended use cases?
The model is particularly well-suited for document classification, layout analysis, and feature extraction tasks. It's designed for processing business documents, forms, and other structured documents within its 16 predefined classes.