ShortcutsBench: A Large-Scale Real-world Benchmark for API-based Agents

Back

Published

Jun 28, 2024

Updated

Jul 22, 2024

Can AI Really Use Your Apps? A New Benchmark Reveals the Truth

ShortcutsBench: A Large-Scale Real-world Benchmark for API-based Agents

https://arxiv.org/abs/2407.00132v2

Summary

Imagine asking your AI assistant to book a flight, schedule a meeting, or even create a playlist—all using the apps on your phone. While this sounds like the future, how close are we really? Researchers have developed ShortcutsBench, a groundbreaking benchmark that tests an AI's ability to use real-world APIs, the building blocks of your favorite apps. Instead of focusing on theoretical tasks, ShortcutsBench uses actual APIs from Apple's operating systems and real user requests taken from the Shortcuts app. The results? While AI has made impressive strides, it still struggles with complex, multi-step actions. Current AI agents, even those powered by cutting-edge large language models (LLMs) like Gemini and GPT, excel at simple tasks like "Check the weather and tell me." However, they stumble when faced with scenarios requiring multiple app interactions or intricate parameter settings. The research shows that the biggest hurdle for AI isn't selecting the right app, but rather figuring out *how* to use it. Specifically, extracting the right information from your request and plugging it into the correct parameters within the API call proves surprisingly difficult. Another significant challenge is the AI’s awareness of missing information. Often, a user request doesn't provide every detail an app needs to function. For instance, asking AI to "book a table" requires it to realize it needs to know *where*, *when*, and for *how many people*. The current generation of AI often overlooks these implicit requirements. ShortcutsBench’s innovative design, using real-world APIs and genuine user queries, gives us a far clearer picture of AI's true capabilities. While there’s still work to be done, this research paves the way for truly helpful, app-savvy AI assistants in the not-so-distant future.

🍰 Interesting in building your own agents?

PromptLayer provides the tools to manage and monitor prompts with your whole team. Get started for free.

Question & Answers

How does ShortcutsBench evaluate an AI's ability to interact with APIs?

ShortcutsBench evaluates AI performance using real-world APIs from Apple's operating systems and actual user requests from the Shortcuts app. The benchmark specifically tests two key capabilities: 1) The AI's ability to select appropriate APIs for given tasks, and 2) Its proficiency in correctly parameterizing API calls with user-provided information. For example, when a user requests 'book a table,' the system must identify the restaurant booking API and recognize the need for essential parameters like location, time, and party size. This real-world testing approach provides a more accurate assessment of AI's practical capabilities compared to theoretical benchmarks.

What are the main benefits of AI-powered app automation in daily life?

AI-powered app automation can significantly streamline everyday tasks by allowing users to control multiple applications through simple voice or text commands. The primary benefits include time savings, reduced cognitive load, and improved task efficiency. For instance, instead of manually opening weather apps, calendar apps, and messaging apps separately, users could potentially ask their AI assistant to check the weather, schedule meetings, and send notifications in one go. While current AI still has limitations with complex tasks, even basic automation can help users manage their digital lives more effectively.

How will AI assistants transform the way we interact with mobile apps?

AI assistants are poised to revolutionize mobile app interaction by creating a more intuitive and unified user experience. Instead of navigating multiple apps separately, users will be able to accomplish tasks through natural language commands, with AI handling the technical details behind the scenes. This transformation will particularly benefit less tech-savvy users and those with accessibility needs. While current AI still struggles with complex multi-step actions, ongoing research and development suggest that more sophisticated app control through AI will become increasingly common in the near future.

PromptLayer Features

Testing & Evaluation
ShortcutsBench's evaluation of AI's ability to handle API calls directly relates to systematic prompt testing needs

Implementation Details

Create test suites that evaluate prompt performance across different API interaction scenarios, similar to ShortcutsBench's methodology

Key Benefits

• Systematic evaluation of prompt effectiveness for API interactions • Identification of parameter extraction accuracy • Measurement of multi-step task completion success rates

Potential Improvements

• Add real-world API interaction test cases • Implement parameter validation checks • Create complexity-based test categorization

Business Value

Efficiency Gains

Reduce development time by 40% through automated testing of API interaction capabilities

Cost Savings

Lower error rates in production by catching API interaction issues early

Quality Improvement

More reliable AI assistance through systematic validation

Analytics
Workflow Management
The paper's focus on multi-step actions and parameter handling aligns with workflow orchestration needs

Implementation Details

Design workflow templates that handle complex API interactions and parameter validation

Key Benefits

• Structured handling of multi-step API interactions • Standardized parameter validation processes • Reusable workflow components

Potential Improvements

• Add dynamic parameter validation • Implement context-aware workflow selection • Create automated parameter completion

Business Value

Efficiency Gains

Reduce complex task implementation time by 50% through reusable workflows

Cost Savings

Minimize API usage costs through optimized interaction patterns

Quality Improvement

Higher success rates in complex multi-step operations

Can AI Really Use Your Apps? A New Benchmark Reveals the Truth

Summary

Question & Answers

PromptLayer Features

The first platform built for prompt engineering