donut-base-finetuned-rvlcdip

Maintained By
naver-clova-ix

Donut-base-finetuned-rvlcdip

PropertyValue
AuthorNaver Clova IX
PaperDonut: Document Understanding Transformer without OCR
Model TypeDocument Understanding Transformer
ArchitectureSwin Transformer + BART

What is donut-base-finetuned-rvlcdip?

Donut is an innovative document understanding transformer that processes documents without requiring traditional OCR (Optical Character Recognition). This particular model is the base version fine-tuned specifically on the RVL-CDIP dataset, making it particularly effective for document image classification tasks.

Implementation Details

The model employs a dual-architecture approach, combining a Swin Transformer as the vision encoder with a BART text decoder. The vision encoder processes the input document image and converts it into a tensor of embeddings (shape: batch_size, seq_len, hidden_size). The BART decoder then uses these embeddings to generate text output in an autoregressive manner.

  • Vision Encoder: Swin Transformer architecture for image processing
  • Text Decoder: BART-based autoregressive text generation
  • Fine-tuned specifically on RVL-CDIP dataset
  • OCR-free approach to document understanding

Core Capabilities

  • Document image classification
  • Direct text generation from document images
  • End-to-end document understanding without OCR
  • Efficient processing of complex document layouts

Frequently Asked Questions

Q: What makes this model unique?

The model's key distinction is its ability to understand documents without requiring OCR preprocessing, making it more efficient and potentially more accurate for certain document processing tasks. It uses a novel approach combining vision and language models in a single transformer architecture.

Q: What are the recommended use cases?

This model is particularly well-suited for document image classification tasks, especially those involving the RVL-CDIP dataset. It can be used in applications requiring automated document categorization, information extraction, and document understanding without OCR preprocessing.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.