MGP-STR Base Model
Property | Value |
---|---|
Parameter Count | 148M |
Model Type | Vision Transformer (ViT) |
Training Data | MJSynth and SynthText |
Paper | Multi-Granularity Prediction for Scene Text Recognition |
Tensor Type | F32 |
What is mgp-str-base?
MGP-STR base is an advanced scene text recognition model that combines Vision Transformer architecture with specially designed A^3 modules. Developed by Alibaba DAMO Academy, it represents a significant advancement in optical character recognition technology, processing images at multiple granularity levels for improved accuracy.
Implementation Details
The model processes images of size 32x128 by dividing them into 4x4 patches. It utilizes a ViT architecture initialized from DeiT-base weights, with custom modifications for text recognition tasks. The unique A^3 modules select and combine meaningful tokens from the ViT output to predict characters and subwords.
- Implements multi-granularity prediction combining character, subword, and word-level recognition
- Uses BPE and WordPiece A^3 modules for subword predictions
- Incorporates absolute position embeddings for spatial awareness
- Features an effective fusion strategy for prediction integration
Core Capabilities
- High-accuracy scene text recognition
- Robust handling of various text styles and orientations
- Efficient processing of 32x128 pixel images
- Advanced language modeling through subword classification
Frequently Asked Questions
Q: What makes this model unique?
The model's unique strength lies in its multi-granularity prediction approach, combining character, subword, and word-level recognition through specialized A^3 modules. This allows for more robust and accurate text recognition across various scenarios.
Q: What are the recommended use cases?
The model is ideal for optical character recognition (OCR) tasks, particularly in scenarios involving scene text recognition. It's well-suited for applications in document processing, automated data extraction, and real-world text recognition systems.