Donut-base-finetuned-invoices

Property	Value
License	cc-by-nc-sa-4.0
Research Paper	OCR-free Document Understanding Transformer
Input Resolution	1280x1920 pixels
Training Duration	4 hours (20k steps)

What is donut-base-finetuned-invoices?

This model is a specialized version of the Donut architecture, fine-tuned specifically for processing and understanding invoices across multiple languages. It combines a Swin Transformer vision encoder with a BART text decoder to extract key information from invoice documents without traditional OCR methods.

Implementation Details

The model was trained on a proprietary dataset of thousands of annotated invoices and non-invoices using an NVIDIA RTX A4000 GPU. It processes single-page documents at a resolution of 1280x1920 pixels, optimized for 150 DPI or lower.

Trained for 20,000 steps with a final validation metric of 0.034
Supports extraction of key fields including DocType, Currency, DocumentDate, GrossAmount, InvoiceNumber, NetAmount, TaxAmount, OrderNumber, and CreditorCountry
Implements a vision-encoder-decoder architecture for end-to-end document understanding

Core Capabilities

Multilingual invoice processing
OCR-free document understanding
Automatic field extraction and classification
Document type identification (Invoice vs Other)

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its ability to process invoices without traditional OCR, using a transformer-based architecture that can handle multiple languages and various invoice formats in a single pass.

Q: What are the recommended use cases?

The model is ideal for automated invoice processing systems, financial document analysis, and research applications requiring multilingual invoice understanding. It's particularly useful for organizations dealing with international invoices and requiring automated data extraction.