TB-OCR-preview-0.1
Property | Value |
---|---|
Parameter Count | 4.25B |
Model Type | OCR / Image-Text-to-Text |
License | MIT |
Base Model | Microsoft/Phi-3.5-vision-instruct |
Memory Requirement | 2.8GB VRAM (4-bit) |
What is TB-OCR-preview-0.1?
TB-OCR-preview-0.1 is an innovative end-to-end OCR model developed by Yifei Hu that uniquely handles text, mathematical LaTeX expressions, and markdown formatting simultaneously. This preview model, trained on approximately 250k image-text pairs (~50M tokens), represents a significant advancement in OCR technology by eliminating the need for separate line detection or math formula detection processes.
Implementation Details
Built on the Microsoft Phi-3.5-vision-instruct architecture, TB-OCR utilizes transformer-based technology and can be efficiently run using 4-bit quantization, requiring only 2.8GB of VRAM. The model processes input text blocks and generates clean markdown output, with special handling for headers and mathematical expressions.
- Efficient 4-bit quantization support
- Flash Attention 2 implementation
- BF16 tensor type optimization
- Markdown-formatted output generation
Core Capabilities
- Automatic header detection and markdown formatting (##)
- Mathematical expression handling with proper LaTeX bracketing
- Integrated text and formula recognition
- Efficient processing of text blocks
- Support for batch inference and parallel processing
Frequently Asked Questions
Q: What makes this model unique?
TB-OCR stands out for its ability to handle text, math LaTeX, and markdown formats in a single pass, eliminating the need for separate detection steps. Its efficient 4-bit quantization allows for high-performance OCR with minimal VRAM requirements.
Q: What are the recommended use cases?
The model is specifically designed for processing individual text blocks rather than full pages. For full-page OCR, it's recommended to use it in conjunction with TFT-ID-1.0 for text/table/figure detection and to process larger text blocks in parallel for optimal performance.