OmniParser-v2.0
Property | Value |
---|---|
Developer | Microsoft |
License | AGPL (icon_detect), MIT (icon_caption) |
Model Type | Screen Parsing Tool |
Performance | 0.6s/frame on A100, 0.8s on 4090 |
What is OmniParser-v2.0?
OmniParser-v2.0 is a sophisticated screen parsing tool designed to transform UI screenshots into structured data formats. It combines a finetuned YOLOv8 model with a Florence-2 base model to detect interactive elements and interpret their functionality. This latest version brings significant improvements including 60% better latency and enhanced accuracy on the ScreenSpot Pro benchmark.
Implementation Details
The model architecture integrates two primary components: an interactable icon detection system trained on popular web pages, and an icon description module that maps UI elements to their functions. It's built to work seamlessly with various LLM platforms including OpenAI, DeepSeek, Qwen, and Anthropic Computer Use.
- Improved dataset with cleaner icon caption and grounding annotations
- Enhanced processing speed: 0.6s/frame on A100 GPU
- 39.6 average accuracy on ScreenSpot Pro benchmark
- Unified OmniTool interface for Windows 11 VM control
Core Capabilities
- Accurate detection of clickable and actionable regions in UI
- Semantic interpretation of UI elements
- Cross-platform support (PC and Phone applications)
- Structured output format for LLM integration
Frequently Asked Questions
Q: What makes this model unique?
OmniParser-v2.0 stands out for its ability to combine accurate UI element detection with semantic understanding, while maintaining high performance and low latency. The single OmniTool interface makes it exceptionally user-friendly for developing UI automation solutions.
Q: What are the recommended use cases?
The model is ideal for UI automation, screen reading applications, and building LLM-based UI agents. It's particularly suited for scenarios requiring structured interpretation of user interfaces, though human oversight is recommended for critical applications.