UI-TARS-72B-DPO

Maintained By
bytedance-research

UI-TARS-72B-DPO

PropertyValue
Model Size72B parameters
PaperarXiv:2501.12326
Repositoryhttps://github.com/bytedance/UI-TARS
AuthorByteDance Research

What is UI-TARS-72B-DPO?

UI-TARS-72B-DPO is a groundbreaking native GUI agent model that represents the next generation of automated interface interaction. This model uniquely integrates perception, reasoning, grounding, and memory capabilities within a single vision-language model, enabling end-to-end task automation without predefined workflows or manual rules.

Implementation Details

The model implements a comprehensive architecture that allows it to understand and interact with graphical user interfaces naturally. It demonstrates exceptional performance across multiple benchmarks, including VisualWebBench (82.8%), WebSRC (89.3%), and SQAshort (88.6%).

  • Integrated perception and reasoning capabilities
  • End-to-end task automation
  • Superior performance in GUI interaction tasks
  • Advanced grounding capabilities across different interface types

Core Capabilities

  • Exceptional performance in mobile interface interaction (94.8% accuracy)
  • Strong desktop environment handling (91.2% accuracy)
  • Web interface manipulation (91.5% accuracy)
  • Cross-domain task execution with high success rates (62.1% SR)
  • Robust element recognition and operation execution

Frequently Asked Questions

Q: What makes this model unique?

UI-TARS-72B-DPO stands out for its unified approach to GUI interaction, combining all essential components in a single model rather than using traditional modular frameworks. It achieves state-of-the-art performance across various benchmarks and can handle complex GUI interactions without predefined rules.

Q: What are the recommended use cases?

The model excels in automated GUI testing, user interface interaction automation, cross-platform task execution, and general purpose interface manipulation across mobile, desktop, and web platforms. It's particularly effective for complex workflows requiring both visual understanding and logical reasoning.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.