UI-TARS-7B-SFT

Maintained By
bytedance-research

UI-TARS-7B-SFT

PropertyValue
Model Size7B parameters
TypeGUI Interaction Model
PaperarXiv:2501.12326
AuthorByteDance Research

What is UI-TARS-7B-SFT?

UI-TARS-7B-SFT is a revolutionary native GUI agent model that integrates perception, reasoning, grounding, and memory capabilities into a single vision-language model. It's designed to interact with graphical user interfaces in a human-like manner, without requiring predefined workflows or manual rules.

Implementation Details

The model represents a significant advancement in GUI interaction technology, implementing an end-to-end approach that combines multiple capabilities traditionally handled by separate modules. It achieves remarkable performance across various benchmarks, including ScreenSpot (89.5% average accuracy) and Mind2Web tasks.

  • Integrated perception and reasoning capabilities
  • End-to-end GUI interaction without predefined rules
  • Superior performance in both text and icon/widget recognition
  • Robust cross-domain functionality

Core Capabilities

  • Visual Understanding: 93.6% accuracy on WebSRC benchmark
  • Element Grounding: 47.8% text accuracy and 16.2% icon accuracy
  • Multi-platform Support: Excellent performance across mobile, desktop, and web interfaces
  • Task Automation: 67.1% success rate in cross-task scenarios

Frequently Asked Questions

Q: What makes this model unique?

UI-TARS-7B-SFT stands out for its unified approach to GUI interaction, combining multiple capabilities in a single model rather than using separate modules. It achieves state-of-the-art performance across various benchmarks and can handle complex GUI interactions across different platforms.

Q: What are the recommended use cases?

The model is ideal for automated GUI testing, user interface interaction automation, accessibility tools, and general GUI-based task automation across mobile, desktop, and web platforms. It's particularly effective in scenarios requiring both visual understanding and interactive decision-making.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.