UI-TARS-72B-DPO

UI-TARS-72B-DPO

bytedance-research

UI-TARS-72B-DPO is a cutting-edge GUI interaction model with superior perception and reasoning capabilities, achieving 90.3% accuracy in GUI tasks

PropertyValue
Model Size72B parameters
PaperarXiv:2501.12326
Repositoryhttps://github.com/bytedance/UI-TARS
AuthorByteDance Research

What is UI-TARS-72B-DPO?

UI-TARS-72B-DPO is a groundbreaking native GUI agent model that represents the next generation of automated interface interaction. This model uniquely integrates perception, reasoning, grounding, and memory capabilities within a single vision-language model, enabling end-to-end task automation without predefined workflows or manual rules.

Implementation Details

The model implements a comprehensive architecture that allows it to understand and interact with graphical user interfaces naturally. It demonstrates exceptional performance across multiple benchmarks, including VisualWebBench (82.8%), WebSRC (89.3%), and SQAshort (88.6%).

  • Integrated perception and reasoning capabilities
  • End-to-end task automation
  • Superior performance in GUI interaction tasks
  • Advanced grounding capabilities across different interface types

Core Capabilities

  • Exceptional performance in mobile interface interaction (94.8% accuracy)
  • Strong desktop environment handling (91.2% accuracy)
  • Web interface manipulation (91.5% accuracy)
  • Cross-domain task execution with high success rates (62.1% SR)
  • Robust element recognition and operation execution

Frequently Asked Questions

Q: What makes this model unique?

UI-TARS-72B-DPO stands out for its unified approach to GUI interaction, combining all essential components in a single model rather than using traditional modular frameworks. It achieves state-of-the-art performance across various benchmarks and can handle complex GUI interactions without predefined rules.

Q: What are the recommended use cases?

The model excels in automated GUI testing, user interface interaction automation, cross-platform task execution, and general purpose interface manipulation across mobile, desktop, and web platforms. It's particularly effective for complex workflows requiring both visual understanding and logical reasoning.

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026