OmniParser-v2.0

Property	Value
Developer	Microsoft
License	AGPL (icon_detect), MIT (icon_caption)
Model Type	Screen Parsing Tool
Performance	0.6s/frame on A100, 0.8s on 4090

What is OmniParser-v2.0?

OmniParser-v2.0 is a sophisticated screen parsing tool designed to transform UI screenshots into structured data formats. It combines a finetuned YOLOv8 model with a Florence-2 base model to detect interactive elements and interpret their functionality. This latest version brings significant improvements including 60% better latency and enhanced accuracy on the ScreenSpot Pro benchmark.

Implementation Details

The model architecture integrates two primary components: an interactable icon detection system trained on popular web pages, and an icon description module that maps UI elements to their functions. It's built to work seamlessly with various LLM platforms including OpenAI, DeepSeek, Qwen, and Anthropic Computer Use.

Improved dataset with cleaner icon caption and grounding annotations
Enhanced processing speed: 0.6s/frame on A100 GPU
39.6 average accuracy on ScreenSpot Pro benchmark
Unified OmniTool interface for Windows 11 VM control

Core Capabilities

Accurate detection of clickable and actionable regions in UI
Semantic interpretation of UI elements
Cross-platform support (PC and Phone applications)
Structured output format for LLM integration

Frequently Asked Questions

Q: What makes this model unique?

OmniParser-v2.0 stands out for its ability to combine accurate UI element detection with semantic understanding, while maintaining high performance and low latency. The single OmniTool interface makes it exceptionally user-friendly for developing UI automation solutions.

Q: What are the recommended use cases?

The model is ideal for UI automation, screen reading applications, and building LLM-based UI agents. It's particularly suited for scenarios requiring structured interpretation of user interfaces, though human oversight is recommended for critical applications.

OmniParser-v2.0

OmniParser-v2.0

What is OmniParser-v2.0?

Implementation Details

Core Capabilities

Frequently Asked Questions

Q: What makes this model unique?

Q: What are the recommended use cases?

Related Models