Published
Jun 28, 2024
Updated
Jul 22, 2024

Can AI Really Use Your Apps? A New Benchmark Reveals the Truth

ShortcutsBench: A Large-Scale Real-world Benchmark for API-based Agents
By
Haiyang Shen|Yue Li|Desong Meng|Dongqi Cai|Sheng Qi|Li Zhang|Mengwei Xu|Yun Ma

Summary

Imagine asking your AI assistant to book a flight, schedule a meeting, or even create a playlist—all using the apps on your phone. While this sounds like the future, how close are we really? Researchers have developed ShortcutsBench, a groundbreaking benchmark that tests an AI's ability to use real-world APIs, the building blocks of your favorite apps. Instead of focusing on theoretical tasks, ShortcutsBench uses actual APIs from Apple's operating systems and real user requests taken from the Shortcuts app. The results? While AI has made impressive strides, it still struggles with complex, multi-step actions. Current AI agents, even those powered by cutting-edge large language models (LLMs) like Gemini and GPT, excel at simple tasks like "Check the weather and tell me." However, they stumble when faced with scenarios requiring multiple app interactions or intricate parameter settings. The research shows that the biggest hurdle for AI isn't selecting the right app, but rather figuring out *how* to use it. Specifically, extracting the right information from your request and plugging it into the correct parameters within the API call proves surprisingly difficult. Another significant challenge is the AI’s awareness of missing information. Often, a user request doesn't provide every detail an app needs to function. For instance, asking AI to "book a table" requires it to realize it needs to know *where*, *when*, and for *how many people*. The current generation of AI often overlooks these implicit requirements. ShortcutsBench’s innovative design, using real-world APIs and genuine user queries, gives us a far clearer picture of AI's true capabilities. While there’s still work to be done, this research paves the way for truly helpful, app-savvy AI assistants in the not-so-distant future.
🍰 Interesting in building your own agents?
PromptLayer provides the tools to manage and monitor prompts with your whole team. Get started for free.

Question & Answers

How does ShortcutsBench evaluate an AI's ability to interact with APIs?
ShortcutsBench evaluates AI performance using real-world APIs from Apple's operating systems and actual user requests from the Shortcuts app. The benchmark specifically tests two key capabilities: 1) The AI's ability to select appropriate APIs for given tasks, and 2) Its proficiency in correctly parameterizing API calls with user-provided information. For example, when a user requests 'book a table,' the system must identify the restaurant booking API and recognize the need for essential parameters like location, time, and party size. This real-world testing approach provides a more accurate assessment of AI's practical capabilities compared to theoretical benchmarks.
What are the main benefits of AI-powered app automation in daily life?
AI-powered app automation can significantly streamline everyday tasks by allowing users to control multiple applications through simple voice or text commands. The primary benefits include time savings, reduced cognitive load, and improved task efficiency. For instance, instead of manually opening weather apps, calendar apps, and messaging apps separately, users could potentially ask their AI assistant to check the weather, schedule meetings, and send notifications in one go. While current AI still has limitations with complex tasks, even basic automation can help users manage their digital lives more effectively.
How will AI assistants transform the way we interact with mobile apps?
AI assistants are poised to revolutionize mobile app interaction by creating a more intuitive and unified user experience. Instead of navigating multiple apps separately, users will be able to accomplish tasks through natural language commands, with AI handling the technical details behind the scenes. This transformation will particularly benefit less tech-savvy users and those with accessibility needs. While current AI still struggles with complex multi-step actions, ongoing research and development suggest that more sophisticated app control through AI will become increasingly common in the near future.

PromptLayer Features

  1. Testing & Evaluation
  2. ShortcutsBench's evaluation of AI's ability to handle API calls directly relates to systematic prompt testing needs
Implementation Details
Create test suites that evaluate prompt performance across different API interaction scenarios, similar to ShortcutsBench's methodology
Key Benefits
• Systematic evaluation of prompt effectiveness for API interactions • Identification of parameter extraction accuracy • Measurement of multi-step task completion success rates
Potential Improvements
• Add real-world API interaction test cases • Implement parameter validation checks • Create complexity-based test categorization
Business Value
Efficiency Gains
Reduce development time by 40% through automated testing of API interaction capabilities
Cost Savings
Lower error rates in production by catching API interaction issues early
Quality Improvement
More reliable AI assistance through systematic validation
  1. Workflow Management
  2. The paper's focus on multi-step actions and parameter handling aligns with workflow orchestration needs
Implementation Details
Design workflow templates that handle complex API interactions and parameter validation
Key Benefits
• Structured handling of multi-step API interactions • Standardized parameter validation processes • Reusable workflow components
Potential Improvements
• Add dynamic parameter validation • Implement context-aware workflow selection • Create automated parameter completion
Business Value
Efficiency Gains
Reduce complex task implementation time by 50% through reusable workflows
Cost Savings
Minimize API usage costs through optimized interaction patterns
Quality Improvement
Higher success rates in complex multi-step operations

The first platform built for prompt engineering