dit-doclaynet
Property | Value |
---|---|
Author | jzju |
Base Model | microsoft/dit-large |
Training Data | DocLayNet-v1.1 |
Model Hub | Hugging Face |
What is dit-doclaynet?
dit-doclaynet is a specialized document layout analysis model built on Microsoft's Document Image Transformer (DIT) architecture. The model has been specifically trained to perform semantic segmentation of document images, capable of identifying 11 distinct document element types including captions, footnotes, formulas, and more.
Implementation Details
The model leverages the BeitForSemanticSegmentation architecture and was trained for 4 epochs on the DocLayNet-v1.1 dataset. It processes input images and outputs logits of shape (batch_size, num_labels, height, width), where each label corresponds to a specific document element type.
- Built on microsoft/dit-large architecture
- Uses AutoImageProcessor for image preprocessing
- Outputs 11-class semantic segmentation maps
- Supports standard document image resolutions
Core Capabilities
- Identifies and segments 11 document elements: Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, and Title
- Processes RGB document images
- Generates pixel-wise segmentation masks
- Supports batch processing of documents
Frequently Asked Questions
Q: What makes this model unique?
The model specializes in comprehensive document layout analysis, offering fine-grained segmentation of 11 different document elements, making it particularly useful for document understanding and processing tasks.
Q: What are the recommended use cases?
This model is ideal for document processing pipelines, academic paper analysis, automated document understanding systems, and any application requiring detailed document structure analysis. It's particularly useful for extracting structured information from complex document layouts.