ShowUI-2B

Maintained By
showlab

ShowUI-2B

PropertyValue
Parameter Count2.21B
Model TypeVision-Language-Action
Base ModelQwen2-VL-2B-Instruct
PaperarXiv:2411.17465
Tensor TypeBF16

What is ShowUI-2B?

ShowUI-2B is a lightweight vision-language-action model specifically designed for GUI agents. Built on the Qwen2-VL architecture, it enables sophisticated interaction with computer interfaces through visual understanding and action generation. The model represents a significant advancement in AI-driven interface manipulation, capable of understanding screen contents and executing precise actions.

Implementation Details

The model is implemented using PyTorch and utilizes Safetensors for efficient parameter storage. It features a sophisticated architecture that processes both visual and textual inputs to generate appropriate interface actions.

  • Built on Qwen2-VL-2B-Instruct architecture
  • Supports multiple action types including CLICK, INPUT, SELECT, HOVER, and more
  • Processes images with flexible pixel requirements (256x28x28 to 1344x28x28)
  • Implements coordinate-based interface interaction

Core Capabilities

  • UI Grounding: Precise element location identification
  • Multi-modal Navigation: Combines vision and language for interface navigation
  • Action Generation: Produces contextually appropriate interface actions
  • Cross-platform Support: Works with both web and mobile interfaces
  • Coordinate System: Uses relative coordinates (0-1 scale) for precise positioning

Frequently Asked Questions

Q: What makes this model unique?

ShowUI-2B stands out for its specialized focus on GUI interaction, combining vision-language understanding with precise action generation in a lightweight 2B parameter package. Its ability to process screen contents and generate coordinate-based actions makes it particularly suitable for automated interface interaction.

Q: What are the recommended use cases?

The model is ideal for automated GUI testing, interface navigation assistance, and developing AI-powered interface agents. It's particularly useful for web automation, mobile app testing, and creating assistive technology for interface interaction.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.