dit-large-finetuned-rvlcdip

dit-large-finetuned-rvlcdip

microsoft

Large Document Image Transformer model fine-tuned on RVL-CDIP dataset for document classification tasks, with 16-class capability and Microsoft backing.

PropertyValue
AuthorMicrosoft
Research PaperDiT: Self-supervised Pre-training for Document Image Transformer
FrameworkPyTorch
TaskDocument Image Classification

What is dit-large-finetuned-rvlcdip?

The Document Image Transformer (DiT) Large is a sophisticated transformer-based model specifically designed for document image analysis. Pre-trained on the massive IIT-CDIP dataset containing 42 million document images and fine-tuned on RVL-CDIP with 400,000 grayscale images across 16 classes, this model represents a significant advancement in document understanding technology.

Implementation Details

DiT follows the BEiT architecture and processes images as sequences of 16x16 fixed-size patches. The model employs a self-supervised pre-training approach, predicting visual tokens from a discrete VAE encoder based on masked patches. It incorporates absolute position embeddings and utilizes a transformer encoder architecture.

  • Pre-trained on 42 million document images
  • Fine-tuned on 400,000 RVL-CDIP images
  • 16-class classification capability
  • Patch-based image processing (16x16)

Core Capabilities

  • Document image classification across 16 categories
  • Feature extraction for downstream tasks
  • Document layout analysis
  • Table detection capabilities
  • Vector space encoding of document images

Frequently Asked Questions

Q: What makes this model unique?

This model stands out due to its extensive pre-training on 42 million documents and specialized fine-tuning for document classification. Its architecture, identical to BEiT, has proven highly effective for document understanding tasks.

Q: What are the recommended use cases?

The model is particularly well-suited for document classification, layout analysis, and feature extraction tasks. It's designed for processing business documents, forms, and other structured documents within its 16 predefined classes.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026