Donut-base-finetuned-invoices
Property | Value |
---|---|
License | cc-by-nc-sa-4.0 |
Research Paper | OCR-free Document Understanding Transformer |
Input Resolution | 1280x1920 pixels |
Training Duration | 4 hours (20k steps) |
What is donut-base-finetuned-invoices?
This model is a specialized version of the Donut architecture, fine-tuned specifically for processing and understanding invoices across multiple languages. It combines a Swin Transformer vision encoder with a BART text decoder to extract key information from invoice documents without traditional OCR methods.
Implementation Details
The model was trained on a proprietary dataset of thousands of annotated invoices and non-invoices using an NVIDIA RTX A4000 GPU. It processes single-page documents at a resolution of 1280x1920 pixels, optimized for 150 DPI or lower.
- Trained for 20,000 steps with a final validation metric of 0.034
- Supports extraction of key fields including DocType, Currency, DocumentDate, GrossAmount, InvoiceNumber, NetAmount, TaxAmount, OrderNumber, and CreditorCountry
- Implements a vision-encoder-decoder architecture for end-to-end document understanding
Core Capabilities
- Multilingual invoice processing
- OCR-free document understanding
- Automatic field extraction and classification
- Document type identification (Invoice vs Other)
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for its ability to process invoices without traditional OCR, using a transformer-based architecture that can handle multiple languages and various invoice formats in a single pass.
Q: What are the recommended use cases?
The model is ideal for automated invoice processing systems, financial document analysis, and research applications requiring multilingual invoice understanding. It's particularly useful for organizations dealing with international invoices and requiring automated data extraction.