Published
Sep 23, 2024
Updated
Sep 23, 2024

Can LLMs Really Use APIs? A New Benchmark Suite Reveals the Truth

SEAL: Suite for Evaluating API-use of LLMs
By
Woojeong Kim|Ashish Jagmohan|Aditya Vempaty

Summary

Large language models (LLMs) have shown promise in various tasks, but their ability to effectively use external APIs remains a challenge. A new benchmark suite, SEAL (Suite for Evaluating API-use of LLMs), aims to thoroughly assess this capability, uncovering the strengths and weaknesses of current LLMs in real-world API interaction scenarios. SEAL addresses limitations of existing benchmarks like ToolBench and APIGen, which suffer from issues like overfitting due to small test sets, lack of multi-step reasoning tasks, and real-time API instability. SEAL standardizes existing benchmarks into a unified format, incorporates an agent system for API retrieval and planning, and tackles API instability by simulating responses with a GPT-4 powered API simulator and caching. This allows for deterministic evaluations and more reliable comparisons. The system evaluates across the entire API-use pipeline, from API retrieval and accurate parameter passing to final response quality and correctness. SEAL's evaluation reveals that while LLMs can perform well on simple API calls, their performance degrades with increasing API pool size and complex queries. Common errors include incorrect API retrieval, particularly in multi-tool scenarios, and inaccurate parameter passing due to difficulties extracting information from complex queries. These findings emphasize the need for better retrieval methods, more robust multi-step planning, and more sophisticated understanding of API documentation by LLMs. SEAL offers a significant advancement in evaluating the practical applicability of LLMs in real-world scenarios, paving the way for more focused research and development to enhance their tool-use capabilities.
🍰 Interesting in building your own agents?
PromptLayer provides the tools to manage and monitor prompts with your whole team. Get started for free.

Question & Answers

How does SEAL's API simulator and caching system work to address API instability issues?
SEAL uses a GPT-4 powered API simulator combined with caching to create stable, reproducible API testing environments. The system works through three main components: 1) A GPT-4 based simulator that generates consistent API responses based on expected behavior, 2) A caching mechanism that stores and retrieves simulated responses for identical queries, and 3) A standardization layer that ensures uniform response formats. For example, when testing an e-commerce API, instead of relying on potentially unstable live endpoints, SEAL would simulate product search responses and cache them for consistent evaluation across multiple test runs. This approach enables deterministic evaluations and reliable comparisons between different LLM models.
What are the main challenges LLMs face when using APIs in real-world applications?
Large Language Models face several key challenges when interacting with APIs in practical applications. The main difficulties include accurate API selection from large pools of options, proper parameter handling, and managing complex multi-step queries. These challenges affect everyday applications like virtual assistants, automated customer service, and workflow automation tools. For businesses, this means that while LLMs can handle simple API interactions (like basic weather queries), they may struggle with more complex tasks that require multiple API calls or detailed parameter configuration. Understanding these limitations is crucial for organizations planning to implement LLM-based automation solutions.
What benefits does automated API testing bring to software development?
Automated API testing brings numerous advantages to modern software development processes. It ensures consistent quality checking, reduces human error, and speeds up the development cycle significantly. For businesses, this means faster time-to-market, reduced testing costs, and more reliable applications. Common applications include continuous integration pipelines, regression testing, and performance monitoring. For example, a web development team can automatically verify hundreds of API endpoints in minutes instead of hours of manual testing. This approach is particularly valuable in agile environments where rapid development and deployment are crucial.

PromptLayer Features

  1. Testing & Evaluation
  2. SEAL's standardized benchmark approach aligns with PromptLayer's testing capabilities for evaluating LLM API interactions
Implementation Details
Configure regression tests comparing LLM responses against cached API calls, implement scoring metrics for API parameter accuracy, set up automated testing pipelines for API interaction scenarios
Key Benefits
• Standardized evaluation of LLM API interactions • Reproducible testing environments • Automated regression testing for API handling
Potential Improvements
• Add API simulation capabilities • Expand metric collection for API parameter accuracy • Implement multi-step API interaction testing
Business Value
Efficiency Gains
Reduces manual testing time by 70% through automated API interaction validation
Cost Savings
Minimizes API usage costs through cached responses and simulation
Quality Improvement
Ensures consistent API handling across LLM versions and updates
  1. Workflow Management
  2. SEAL's multi-step API planning and retrieval system matches PromptLayer's workflow orchestration capabilities
Implementation Details
Create reusable templates for API interaction patterns, implement version tracking for API calls, develop multi-step API workflow orchestration
Key Benefits
• Structured API interaction workflows • Versioned API call templates • Coordinated multi-step API operations
Potential Improvements
• Enhanced API documentation integration • Dynamic workflow adaptation based on API responses • Advanced error handling for API failures
Business Value
Efficiency Gains
Reduces API workflow development time by 50% through templating
Cost Savings
Optimizes API usage through better orchestration and error handling
Quality Improvement
Increases successful API interactions through structured workflows

The first platform built for prompt engineering