PTA-1: Prompt-to-Automation Model

Property	Value
Parameter Count	271M
License	MIT
Base Model	Microsoft Florence-2-base
Language	English
Tensor Type	F32

What is PTA-1?

PTA-1 is a revolutionary vision-language model designed specifically for computer and phone automation tasks. Built on Microsoft's Florence-2 architecture, it achieves remarkable performance in GUI text and element localization despite its relatively compact size of 271M parameters. The model excels at processing screenshots and textual descriptions to precisely locate interface elements, making it ideal for automation tasks.

Implementation Details

The model operates on a simple yet powerful premise: it takes a screenshot and a description of a target element as input and outputs the corresponding bounding box coordinates. Implementation requires basic dependencies including PyTorch, transformers, and Pillow, making it accessible for developers. The model supports both CPU and GPU execution, with automatic tensor type optimization.

Efficient architecture with only 271M parameters
Outperforms larger models in GUI element detection
Supports F32 tensor operations
Local execution capability for reduced latency

Core Capabilities

Precise element localization with 79.98% mean accuracy
Exceptional performance on Wave-UI dataset (90.69%)
Strong text-based element detection (76.28% on PTA-text)
Efficient processing of screenshots and natural language queries

Frequently Asked Questions

Q: What makes this model unique?

PTA-1 stands out for achieving superior performance with significantly fewer parameters than competitors. It outperforms models up to 30 times larger, making it ideal for local deployment and real-time applications.

Q: What are the recommended use cases?

The model is particularly well-suited for GUI automation tasks, including test automation, accessibility tools, and interactive computer vision applications where precise element localization is crucial.

PTA-1