SmolDocling-256M-preview

Property	Value
Developer	Docling Team, IBM Research
Model Type	Multimodal Image-Text-to-Text
License	Apache 2.0
Base Model	Idefics3
Paper	arXiv:2503.11576

What is SmolDocling-256M-preview?

SmolDocling-256M-preview is a cutting-edge multimodal document processing model designed for efficient document conversion. It introduces DocTags, a minimal yet powerful representation system that enables seamless document processing while maintaining compatibility with DoclingDocuments. The model achieves impressive performance with an average inference time of 0.35 seconds per page on an A100 GPU.

Implementation Details

The model implements a sophisticated architecture that combines vision and language processing capabilities. It utilizes the VLLM framework for fast inference and supports both Transformers and VLLM implementations. The model introduces DocTags as an efficient tokenization system that separates text from document structure, making it particularly effective for Image-to-Sequence tasks.

Flash Attention 2 support for CUDA devices
Batch processing capabilities for multiple documents
Seamless integration with Docling ecosystem
Support for multiple output formats including HTML and Markdown

Core Capabilities

Advanced OCR with bounding box support
Precise layout analysis and element localization
Code block recognition with indentation preservation
Mathematical formula detection and processing
Chart and table recognition with header support
Figure classification and caption correspondence
List structure preservation and grouping
Full-page conversion with comprehensive element processing

Frequently Asked Questions

Q: What makes this model unique?

SmolDocling's unique strength lies in its efficient DocTags system and comprehensive document element processing capabilities, while maintaining a relatively small model size of 256M parameters. The model's ability to handle multiple document elements (code, equations, tables, charts) in a single pass makes it particularly valuable for document conversion tasks.

Q: What are the recommended use cases?

The model is ideal for converting both scientific and non-scientific documents to structured formats. It excels in scenarios requiring accurate preservation of document structure, including academic papers, technical documentation, and business documents where maintaining layout and formatting is crucial.