OmniParser-v2.0

Maintained By
microsoft

OmniParser-v2.0

PropertyValue
DeveloperMicrosoft
LicenseAGPL (icon_detect), MIT (icon_caption)
Model TypeScreen Parsing Tool
Performance0.6s/frame on A100, 0.8s on 4090

What is OmniParser-v2.0?

OmniParser-v2.0 is a sophisticated screen parsing tool designed to transform UI screenshots into structured data formats. It combines a finetuned YOLOv8 model with a Florence-2 base model to detect interactive elements and interpret their functionality. This latest version brings significant improvements including 60% better latency and enhanced accuracy on the ScreenSpot Pro benchmark.

Implementation Details

The model architecture integrates two primary components: an interactable icon detection system trained on popular web pages, and an icon description module that maps UI elements to their functions. It's built to work seamlessly with various LLM platforms including OpenAI, DeepSeek, Qwen, and Anthropic Computer Use.

  • Improved dataset with cleaner icon caption and grounding annotations
  • Enhanced processing speed: 0.6s/frame on A100 GPU
  • 39.6 average accuracy on ScreenSpot Pro benchmark
  • Unified OmniTool interface for Windows 11 VM control

Core Capabilities

  • Accurate detection of clickable and actionable regions in UI
  • Semantic interpretation of UI elements
  • Cross-platform support (PC and Phone applications)
  • Structured output format for LLM integration

Frequently Asked Questions

Q: What makes this model unique?

OmniParser-v2.0 stands out for its ability to combine accurate UI element detection with semantic understanding, while maintaining high performance and low latency. The single OmniTool interface makes it exceptionally user-friendly for developing UI automation solutions.

Q: What are the recommended use cases?

The model is ideal for UI automation, screen reading applications, and building LLM-based UI agents. It's particularly suited for scenarios requiring structured interpretation of user interfaces, though human oversight is recommended for critical applications.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.