SWE-bench

A benchmark of real GitHub issues used to evaluate coding agents on their ability to resolve them with verifiable test outcomes.

What is SWE-bench?

‍

SWE-bench is a benchmark for evaluating coding agents on real GitHub issues. In practice, it gives a model a repository and an issue description, then checks whether the patch actually resolves the problem with test-backed verification. (github.com)

Understanding SWE-bench

‍

SWE-bench was designed to measure repository-level software engineering, not just code completion. The original benchmark includes 2,294 problems drawn from real GitHub issues and their corresponding pull requests across 12 popular Python repositories. That setup makes it useful for testing whether an agent can understand a codebase, identify the right files, and produce a change that fits existing behavior. (arxiv.org)

The benchmark is grounded in execution. A candidate fix is judged by running tests that should fail before the fix and pass afterward, while also checking that unrelated functionality still works. The SWE-bench team later introduced SWE-bench Verified, a curated subset of 500 samples with improved test quality and issue clarity, plus a containerized evaluation harness for more reproducible runs. (openai.com)

Key aspects of SWE-bench include:

Real issues: Tasks come from actual GitHub bug reports, not synthetic prompts.
Repository context: Agents work inside the target codebase and must reason across files.
Test-based scoring: Success is judged by whether the patch passes the relevant tests.
Reproducible harness: Docker-based evaluation helps standardize runs across systems.
Verified subsets: Curated variants reduce ambiguous or flawed tasks for cleaner evaluation.

Advantages of SWE-bench

‍

More realistic evaluation: It measures work that looks like actual engineering, not toy coding drills.
Objective outcomes: Passing tests gives a clear success signal.
Good for agent comparison: Teams can compare scaffolds, models, and workflows on the same tasks.
Useful for regression tracking: It helps show whether a new agent change really improves end-to-end repair ability.
Encourages better tooling: Because tasks are hard, it pushes teams toward stronger retrieval, planning, and execution loops.

Challenges in SWE-bench

‍

Ambiguous issues: Some GitHub issues leave too much unsaid for a clean benchmark task.
Test brittleness: A test can reject a correct fix if it is overly specific.
Environment setup: Reproducing the right runtime can be a source of failures unrelated to the patch.
Long context demands: Agents often need to inspect many files and traces at once.
Benchmark contamination: Public benchmark data can be indirectly memorized by models trained on web text.

Example of SWE-bench in action

‍

Scenario: a coding agent is given a bug report from an open-source Python repository saying a function fails for a specific input.

The agent searches the codebase, finds the relevant module, edits the logic, and runs the project tests. If the failing test now passes and unrelated tests still pass, the submission counts as a successful SWE-bench-style fix.

That is why SWE-bench is useful for teams building agents with tool use, code search, and iterative repair loops. It rewards the full workflow, not just a clever one-shot answer.

How PromptLayer helps with SWE-bench

‍

PromptLayer helps teams manage the prompts, traces, and evaluations that sit behind coding agents evaluated on SWE-bench. If you are tuning an agent to plan fixes, call tools, and verify patches, PromptLayer gives you a place to compare prompt versions and inspect what changed across runs.

Ready to try it yourself? Sign up for PromptLayer and start managing your prompts in minutes.