UI-TARS-7B-SFT

Property	Value
Model Size	7B parameters
Type	GUI Interaction Model
Paper	arXiv:2501.12326
Author	ByteDance Research

What is UI-TARS-7B-SFT?

UI-TARS-7B-SFT is a revolutionary native GUI agent model that integrates perception, reasoning, grounding, and memory capabilities into a single vision-language model. It's designed to interact with graphical user interfaces in a human-like manner, without requiring predefined workflows or manual rules.

Implementation Details

The model represents a significant advancement in GUI interaction technology, implementing an end-to-end approach that combines multiple capabilities traditionally handled by separate modules. It achieves remarkable performance across various benchmarks, including ScreenSpot (89.5% average accuracy) and Mind2Web tasks.

Integrated perception and reasoning capabilities
End-to-end GUI interaction without predefined rules
Superior performance in both text and icon/widget recognition
Robust cross-domain functionality

Core Capabilities

Visual Understanding: 93.6% accuracy on WebSRC benchmark
Element Grounding: 47.8% text accuracy and 16.2% icon accuracy
Multi-platform Support: Excellent performance across mobile, desktop, and web interfaces
Task Automation: 67.1% success rate in cross-task scenarios

Frequently Asked Questions

Q: What makes this model unique?

UI-TARS-7B-SFT stands out for its unified approach to GUI interaction, combining multiple capabilities in a single model rather than using separate modules. It achieves state-of-the-art performance across various benchmarks and can handle complex GUI interactions across different platforms.

Q: What are the recommended use cases?

The model is ideal for automated GUI testing, user interface interaction automation, accessibility tools, and general GUI-based task automation across mobile, desktop, and web platforms. It's particularly effective in scenarios requiring both visual understanding and interactive decision-making.

UI-TARS-7B-SFT

UI-TARS-7B-SFT

What is UI-TARS-7B-SFT?

Implementation Details

Core Capabilities

Frequently Asked Questions

Q: What makes this model unique?

Q: What are the recommended use cases?

Related Models