UI-TARS-7B-SFT
Property | Value |
---|---|
Model Size | 7B parameters |
Type | GUI Interaction Model |
Paper | arXiv:2501.12326 |
Author | ByteDance Research |
What is UI-TARS-7B-SFT?
UI-TARS-7B-SFT is a revolutionary native GUI agent model that integrates perception, reasoning, grounding, and memory capabilities into a single vision-language model. It's designed to interact with graphical user interfaces in a human-like manner, without requiring predefined workflows or manual rules.
Implementation Details
The model represents a significant advancement in GUI interaction technology, implementing an end-to-end approach that combines multiple capabilities traditionally handled by separate modules. It achieves remarkable performance across various benchmarks, including ScreenSpot (89.5% average accuracy) and Mind2Web tasks.
- Integrated perception and reasoning capabilities
- End-to-end GUI interaction without predefined rules
- Superior performance in both text and icon/widget recognition
- Robust cross-domain functionality
Core Capabilities
- Visual Understanding: 93.6% accuracy on WebSRC benchmark
- Element Grounding: 47.8% text accuracy and 16.2% icon accuracy
- Multi-platform Support: Excellent performance across mobile, desktop, and web interfaces
- Task Automation: 67.1% success rate in cross-task scenarios
Frequently Asked Questions
Q: What makes this model unique?
UI-TARS-7B-SFT stands out for its unified approach to GUI interaction, combining multiple capabilities in a single model rather than using separate modules. It achieves state-of-the-art performance across various benchmarks and can handle complex GUI interactions across different platforms.
Q: What are the recommended use cases?
The model is ideal for automated GUI testing, user interface interaction automation, accessibility tools, and general GUI-based task automation across mobile, desktop, and web platforms. It's particularly effective in scenarios requiring both visual understanding and interactive decision-making.