mgp-str-base

Maintained By
alibaba-damo

MGP-STR Base Model

PropertyValue
Parameter Count148M
Model TypeVision Transformer (ViT)
Training DataMJSynth and SynthText
PaperMulti-Granularity Prediction for Scene Text Recognition
Tensor TypeF32

What is mgp-str-base?

MGP-STR base is an advanced scene text recognition model that combines Vision Transformer architecture with specially designed A^3 modules. Developed by Alibaba DAMO Academy, it represents a significant advancement in optical character recognition technology, processing images at multiple granularity levels for improved accuracy.

Implementation Details

The model processes images of size 32x128 by dividing them into 4x4 patches. It utilizes a ViT architecture initialized from DeiT-base weights, with custom modifications for text recognition tasks. The unique A^3 modules select and combine meaningful tokens from the ViT output to predict characters and subwords.

  • Implements multi-granularity prediction combining character, subword, and word-level recognition
  • Uses BPE and WordPiece A^3 modules for subword predictions
  • Incorporates absolute position embeddings for spatial awareness
  • Features an effective fusion strategy for prediction integration

Core Capabilities

  • High-accuracy scene text recognition
  • Robust handling of various text styles and orientations
  • Efficient processing of 32x128 pixel images
  • Advanced language modeling through subword classification

Frequently Asked Questions

Q: What makes this model unique?

The model's unique strength lies in its multi-granularity prediction approach, combining character, subword, and word-level recognition through specialized A^3 modules. This allows for more robust and accurate text recognition across various scenarios.

Q: What are the recommended use cases?

The model is ideal for optical character recognition (OCR) tasks, particularly in scenarios involving scene text recognition. It's well-suited for applications in document processing, automated data extraction, and real-world text recognition systems.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.