NESTFUL: A Benchmark for Evaluating LLMs on Nested Sequences of API Calls

Published

Sep 4, 2024

Updated

Sep 4, 2024

Unlocking AI’s Potential: Navigating Nested API Calls

NESTFUL: A Benchmark for Evaluating LLMs on Nested Sequences of API Calls

https://arxiv.org/abs/2409.03797v1

Summary

Imagine asking an AI assistant to plan a trip. It needs to not only search for flights and hotels but also use the results of each search to refine the next, like choosing a hotel near the chosen airport. This intricate dance of interconnected actions, where the output of one task becomes the input for another, is what we call nested API calls. Researchers are exploring this complexity with a new benchmark called NESTFUL. Current AI models, while impressive in handling single API calls, often stumble when faced with nested sequences. Think of it as a chain reaction – a small error in one step can snowball into a larger problem down the line. NESTFUL challenges Large Language Models (LLMs) with this realistic scenario using real-world APIs from areas like travel, finance, and social media, along with more abstract, non-executable API sequences. The benchmark tests the model's ability to select the right APIs, fill in the correct parameters, and execute them in the proper order, even when the required actions aren’t explicitly stated. Early results from NESTFUL reveal that even the most powerful LLMs struggle with this nested complexity. They often fumble with data types, misinterpret API specifications, and fail to chain the operations correctly. This highlights the need for better training data and refined algorithms that can manage these dependencies. However, there’s reason for optimism. NESTFUL provides a vital training ground for the next generation of AI models, allowing researchers to pinpoint weaknesses and design more sophisticated systems. As LLMs learn to navigate the intricacies of nested APIs, they’ll be better equipped to tackle the complex tasks we expect of truly intelligent assistants.

🍰 Interesting in building your own agents?

PromptLayer provides the tools to manage and monitor prompts with your whole team. Get started for free.

Question & Answers

What technical challenges do LLMs face when handling nested API calls according to the NESTFUL benchmark?

LLMs face three primary technical challenges with nested API calls: data type handling, API specification interpretation, and operation chaining. The models struggle to maintain consistency across multiple interconnected API calls, where the output of one operation must correctly feed into another. For example, when planning a trip, an LLM might correctly fetch flight information but fail to properly extract and format the arrival airport data as input for a subsequent hotel search API call. This cascading dependency requires precise handling of data types, proper parameter mapping, and understanding of API specifications - areas where current LLMs show significant limitations. Real-world applications might fail when an LLM incorrectly formats a date string or misinterprets numerical parameters across multiple API calls.

What are the benefits of AI assistants that can handle complex, multi-step tasks?

AI assistants capable of handling complex, multi-step tasks offer significant advantages in automation and efficiency. They can seamlessly coordinate multiple operations, like booking travel arrangements that require checking flights, hotels, and local transportation in a logical sequence. The key benefit is reduced human intervention - instead of manually coordinating multiple services, users can simply state their end goal. These AI assistants can be valuable in various industries, from travel planning to financial services, where multiple API calls need to be coordinated. For example, a business could use such an AI to automate customer service processes that involve multiple systems or databases.

How will improvements in nested API handling change the future of AI assistants?

Advancements in nested API handling will transform AI assistants into more capable and autonomous problem-solvers. As these systems become better at managing complex, interconnected tasks, they'll be able to handle more sophisticated requests with less human oversight. This evolution will lead to AI assistants that can truly function as personal coordinators, managing everything from complex travel arrangements to multi-step business processes. For instance, future AI assistants might independently handle an entire event planning process, coordinating venues, catering, invitations, and schedules through multiple service providers, all while maintaining logical dependencies between each step.

PromptLayer Features

Workflow Management
NESTFUL's nested API call sequences align with PromptLayer's multi-step orchestration capabilities for managing complex, interdependent prompt chains

Implementation Details

Create reusable templates for common API call patterns, implement version tracking for each step, establish dependency management between steps, monitor data flow between calls

Key Benefits

• Reproducible execution of complex API call sequences • Transparent tracking of data flow between steps • Easier debugging of nested dependencies

Potential Improvements

• Add visual workflow builder for API sequences • Implement automated dependency validation • Create pre-built templates for common API patterns

Business Value

Efficiency Gains

50% reduction in time spent managing complex API workflows

Cost Savings

30% decrease in API usage costs through optimized call sequences

Quality Improvement

90% reduction in errors from mismanaged API dependencies

Analytics
Testing & Evaluation
NESTFUL's benchmark methodology maps to PromptLayer's testing capabilities for evaluating complex prompt chains and API interactions

Implementation Details

Set up batch tests for API sequences, implement regression testing for chain accuracy, create scoring metrics for successful API call completion

Key Benefits

• Comprehensive testing of nested API scenarios • Early detection of chain breaking changes • Quantitative performance tracking

Potential Improvements

• Add specialized metrics for API chain testing • Implement automated chain validation • Create visual chain analysis tools

Business Value

Efficiency Gains

40% faster validation of API chain changes

Cost Savings

25% reduction in debugging time and resources

Quality Improvement

85% increase in successful API chain executions

Unlocking AI’s Potential: Navigating Nested API Calls

Summary

Question & Answers

PromptLayer Features

The first platform built for prompt engineering