dit-large-finetuned-rvlcdip

Maintained By
microsoft

Document Image Transformer (DiT) Large

PropertyValue
AuthorMicrosoft
Research PaperDiT: Self-supervised Pre-training for Document Image Transformer
FrameworkPyTorch
TaskDocument Image Classification

What is dit-large-finetuned-rvlcdip?

The Document Image Transformer (DiT) Large is a sophisticated transformer-based model specifically designed for document image analysis. Pre-trained on the massive IIT-CDIP dataset containing 42 million document images and fine-tuned on RVL-CDIP with 400,000 grayscale images across 16 classes, this model represents a significant advancement in document understanding technology.

Implementation Details

DiT follows the BEiT architecture and processes images as sequences of 16x16 fixed-size patches. The model employs a self-supervised pre-training approach, predicting visual tokens from a discrete VAE encoder based on masked patches. It incorporates absolute position embeddings and utilizes a transformer encoder architecture.

  • Pre-trained on 42 million document images
  • Fine-tuned on 400,000 RVL-CDIP images
  • 16-class classification capability
  • Patch-based image processing (16x16)

Core Capabilities

  • Document image classification across 16 categories
  • Feature extraction for downstream tasks
  • Document layout analysis
  • Table detection capabilities
  • Vector space encoding of document images

Frequently Asked Questions

Q: What makes this model unique?

This model stands out due to its extensive pre-training on 42 million documents and specialized fine-tuning for document classification. Its architecture, identical to BEiT, has proven highly effective for document understanding tasks.

Q: What are the recommended use cases?

The model is particularly well-suited for document classification, layout analysis, and feature extraction tasks. It's designed for processing business documents, forms, and other structured documents within its 16 predefined classes.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.