TokenOCR
Property | Value |
---|---|
Author | TongkunGuan |
License | MIT |
Model Hub | Hugging Face |
Dataset Size | 20M images, 1.8B token-mask pairs |
What is TokenOCR?
TokenOCR is the first token-level visual foundation model specifically designed for text-image-related tasks and document understanding. Built on a massive dataset called TokenIT, it introduces a novel approach to processing and understanding text within images at a granular token level.
Implementation Details
The model comes in three variants, with the recommended version being TokenOCR-4096-English-seg, featuring a ViT backbone and 4096 feature dimension. The architecture aligns token-level image features with language features in the same semantic space, enabling seamless user interaction for various document understanding tasks.
- Trained on TokenIT dataset with 20 million images and 1.8 billion text-mask pairs
- Supports both English and Chinese text interaction (bilingual version)
- Implements an innovative token-level alignment approach for precise text understanding
- Features a two-stage training process including LLM-guided Token Alignment and Supervised Instruction Tuning
Core Capabilities
- Text Retrieval: Advanced token-level text search within images
- Image Segmentation: Precise text region identification
- Visual Question Answering: Sophisticated document understanding and reasoning
- Document Understanding: Enhanced capabilities through TokenVL integration
Frequently Asked Questions
Q: What makes this model unique?
TokenOCR stands out for its token-level processing approach, which offers more precise text understanding compared to traditional image-level or pixel-level models. It's the first of its kind to align visual and textual features at the token level, enabling more accurate document understanding tasks.
Q: What are the recommended use cases?
The model is particularly well-suited for document understanding tasks, text retrieval in images, precise text segmentation, and visual question answering scenarios. The English-specific version (TokenOCR-4096-English-seg) is recommended for optimal performance in English text processing.