Pass@k

A code-evaluation metric measuring the probability that at least one of k generated solutions passes all unit tests.

What is Pass@k?

Pass@k is a code evaluation metric that measures the chance that at least one of k generated solutions passes the test suite. In practice, it is widely used for code generation because a model can fail on one sample and still succeed across multiple tries. (arxiv.org)

Understanding Pass@k

Pass@k comes from functional correctness benchmarking, especially code tasks where outputs can be executed against unit tests. The core idea is simple: generate k candidates, run them through the tests, and count the task as solved if any candidate passes. That makes it a good fit for stochastic models, where one sample may miss but another may be correct. (arxiv.org)

In the original Codex and HumanEval work, repeated sampling was shown to improve measured performance substantially, which helped make pass@k a standard reporting format for code benchmarks. In evaluation workflows, it complements other metrics by capturing the value of search, sampling, and reranking rather than only the first answer. Key aspects of Pass@k include:

Multiple samples: the model gets k attempts on the same problem.
Binary success rule: if any sample passes the tests, the task counts as solved.
Functional focus: it checks whether code works, not whether it merely looks plausible.
Sampling sensitivity: higher k usually increases the score because there are more chances to hit a correct solution.
Benchmark friendly: it is easy to report across models and prompts on the same dataset.

Advantages of Pass@k

Captures search quality: it rewards systems that can explore multiple candidate solutions.
Matches code generation reality: many coding workflows already use best-of-n sampling or reranking.
Easy to interpret: teams can quickly understand what k attempts buy in success rate.
Works with unit tests: it maps cleanly to the way software correctness is often verified.
Useful for comparison: it gives a common benchmark across models, prompts, and decoding settings.

Challenges in Pass@k

Depends on test quality: weak unit tests can overstate true correctness.
Hides single-shot behavior: a strong pass@k can mask poor pass@1 performance.
Costs more compute: higher k means more generations and more test runs.
Not always comparable: results depend on sampling temperature, stopping rules, and candidate selection.
Can encourage over-sampling: teams may optimize for more attempts instead of better first answers.

Example of Pass@k in Action

Scenario: a team is evaluating a code model on 200 programming tasks. For each task, they ask the model for 10 solutions and run every solution against the unit tests.

If at least one of the 10 candidates passes for a task, that task counts as a success for pass@10. If the model solves 146 of the 200 tasks that way, then pass@10 is 73%.

That number tells the team something useful: the model may not be reliable on its first try, but it can often recover with more samples. In a PromptLayer workflow, you can track those runs, compare prompt variants, and see which changes improve functional success across repeated generations.

How PromptLayer helps with Pass@k

PromptLayer helps teams organize code-generation experiments, compare prompts, and track evaluation results over time. That makes it easier to measure pass@k alongside other metrics, spot regressions, and understand whether improvements come from the prompt, decoding settings, or downstream tooling.

Ready to try it yourself? Sign up for PromptLayer and start managing your prompts in minutes.