SmolDocling-256M-preview

Maintained By
ds4sd

SmolDocling-256M-preview

PropertyValue
DeveloperDocling Team, IBM Research
Model TypeMultimodal Image-Text-to-Text
LicenseApache 2.0
Base ModelIdefics3
PaperarXiv:2503.11576

What is SmolDocling-256M-preview?

SmolDocling-256M-preview is a cutting-edge multimodal document processing model designed for efficient document conversion. It introduces DocTags, a minimal yet powerful representation system that enables seamless document processing while maintaining compatibility with DoclingDocuments. The model achieves impressive performance with an average inference time of 0.35 seconds per page on an A100 GPU.

Implementation Details

The model implements a sophisticated architecture that combines vision and language processing capabilities. It utilizes the VLLM framework for fast inference and supports both Transformers and VLLM implementations. The model introduces DocTags as an efficient tokenization system that separates text from document structure, making it particularly effective for Image-to-Sequence tasks.

  • Flash Attention 2 support for CUDA devices
  • Batch processing capabilities for multiple documents
  • Seamless integration with Docling ecosystem
  • Support for multiple output formats including HTML and Markdown

Core Capabilities

  • Advanced OCR with bounding box support
  • Precise layout analysis and element localization
  • Code block recognition with indentation preservation
  • Mathematical formula detection and processing
  • Chart and table recognition with header support
  • Figure classification and caption correspondence
  • List structure preservation and grouping
  • Full-page conversion with comprehensive element processing

Frequently Asked Questions

Q: What makes this model unique?

SmolDocling's unique strength lies in its efficient DocTags system and comprehensive document element processing capabilities, while maintaining a relatively small model size of 256M parameters. The model's ability to handle multiple document elements (code, equations, tables, charts) in a single pass makes it particularly valuable for document conversion tasks.

Q: What are the recommended use cases?

The model is ideal for converting both scientific and non-scientific documents to structured formats. It excels in scenarios requiring accurate preservation of document structure, including academic papers, technical documentation, and business documents where maintaining layout and formatting is crucial.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.