mgp-str-base

mgp-str-base

alibaba-damo

MGP-STR base is a 148M parameter Vision Transformer-based model for scene text recognition, using multi-granularity prediction and A^3 modules.

PropertyValue
Parameter Count148M
Model TypeVision Transformer (ViT)
Training DataMJSynth and SynthText
PaperMulti-Granularity Prediction for Scene Text Recognition
Tensor TypeF32

What is mgp-str-base?

MGP-STR base is an advanced scene text recognition model that combines Vision Transformer architecture with specially designed A^3 modules. Developed by Alibaba DAMO Academy, it represents a significant advancement in optical character recognition technology, processing images at multiple granularity levels for improved accuracy.

Implementation Details

The model processes images of size 32x128 by dividing them into 4x4 patches. It utilizes a ViT architecture initialized from DeiT-base weights, with custom modifications for text recognition tasks. The unique A^3 modules select and combine meaningful tokens from the ViT output to predict characters and subwords.

  • Implements multi-granularity prediction combining character, subword, and word-level recognition
  • Uses BPE and WordPiece A^3 modules for subword predictions
  • Incorporates absolute position embeddings for spatial awareness
  • Features an effective fusion strategy for prediction integration

Core Capabilities

  • High-accuracy scene text recognition
  • Robust handling of various text styles and orientations
  • Efficient processing of 32x128 pixel images
  • Advanced language modeling through subword classification

Frequently Asked Questions

Q: What makes this model unique?

The model's unique strength lies in its multi-granularity prediction approach, combining character, subword, and word-level recognition through specialized A^3 modules. This allows for more robust and accurate text recognition across various scenarios.

Q: What are the recommended use cases?

The model is ideal for optical character recognition (OCR) tasks, particularly in scenarios involving scene text recognition. It's well-suited for applications in document processing, automated data extraction, and real-world text recognition systems.

Related Models

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026