Reference Output

A target answer paired with an input in an eval dataset, against which generated output is compared.

What is Reference Output?

‍Reference output is the target answer paired with an input in an eval dataset, used as the comparison point for a model’s generated output. In practice, it gives teams a concrete standard for checking whether an LLM response matches what was expected.

Understanding Reference Output

‍In evaluation workflows, a reference output is often called the ground truth, ideal answer, or expected answer. It can be a short string, a structured JSON object, a tool call trace, or a longer response, depending on what the task needs. OpenAI’s eval and grader docs describe this as comparing reference answers to model-generated answers, while LangSmith similarly frames datasets as input and reference output pairs. (platform.openai.com)

‍Reference outputs matter because they make automated scoring possible. If the task is deterministic, like extraction or classification, the grader can check for exact or near-exact matches. If the task is more open-ended, the reference output can still serve as a guide for semantic similarity, rubric-based judging, or partial credit. The key idea is simple: the model is being measured against a known target, not just judged in the abstract. (platform.openai.com)

‍Key aspects of Reference Output include:

Target answer: It represents the expected result for a specific input.
Dataset pairing: It is stored alongside the input so evaluators can compare both sides.
Scoring anchor: It gives graders a baseline for pass, fail, similarity, or rubric scores.
Task dependent format: It may be text, labels, structured data, or multi-step trajectories.
Regression detection: It helps teams spot when prompt or model changes reduce quality.

Advantages of Reference Output

‍

Clear success criteria: Teams know exactly what the model is supposed to produce.
Repeatable evaluation: The same input and reference output can be reused across model versions.
Faster iteration: Engineers can test prompt changes without manual review every time.
Better benchmarking: It makes it easier to compare models, prompts, and agent flows fairly.
Auditability: Reviewers can inspect why an output passed or failed against a specific target.

Challenges in Reference Output

‍

Ambiguous tasks: Some prompts have more than one valid answer, which makes a single reference output too narrow.
Maintenance overhead: Reference outputs must be updated when product requirements or schemas change.
Overfitting risk: Teams can optimize for the reference instead of the real user goal.
Formatting sensitivity: Small wording or structure differences can affect scores even when the answer is acceptable.
Coverage gaps: A small set of references may not represent the full range of real-world cases.

Example of Reference Output in Action

‍Scenario: a support team wants to evaluate a chatbot that summarizes refund policy questions.

‍They create an eval row with the input, “Can I get a refund after 30 days?” and the reference output, “No, refunds are only available within 30 days of purchase.” The model’s answer is then checked against that reference output using an exact match, similarity score, or LLM judge depending on how strict the policy should be.

‍If the model says, “Refunds are available only within 30 days,” the output may receive a passing score even if the wording is different. If it says, “Yes, you can request one anytime,” the comparison fails immediately. That makes the reference output a practical control point for quality.

How PromptLayer helps with Reference Output

‍PromptLayer helps teams store prompts, run evaluations, and compare generated outputs against reference outputs in a repeatable workflow. That makes it easier to track prompt changes, review regressions, and keep evaluation data organized as your app evolves.

Ready to try it yourself? Sign up for PromptLayer and start managing your prompts in minutes.