BERTScore

A semantic similarity metric that compares generated and reference text using contextual BERT embeddings.

What is BERTScore?

‍

BERTScore is a semantic similarity metric that compares generated text against a reference using contextual BERT embeddings. Instead of relying on exact word overlap, it measures how closely the meaning of the two texts aligns. (arxiv.org)

Understanding BERTScore

‍

In practice, BERTScore tokenizes the candidate and reference text, embeds each token with a pretrained transformer model, and then matches tokens by cosine similarity. The result is usually reported as precision, recall, and F1, which makes it useful for tasks like summarization, translation, and open-ended generation where paraphrases can be correct even when they do not share many exact words. (arxiv.org)

For teams evaluating LLM outputs, BERTScore sits between brittle string-match metrics and fully human review. It is not a full judgment of factuality or instruction following, but it gives a useful signal for semantic closeness, especially when you want an automatic metric that tracks meaning better than n-gram overlap. The PromptLayer team often treats metrics like this as one part of a broader evaluation stack.

Key aspects of BERTScore include:

Contextual embeddings: It uses transformer-based representations, so word meaning depends on surrounding text.
Similarity matching: Candidate and reference tokens are paired by embedding similarity rather than exact string overlap.
Precision, recall, and F1: These scores help you inspect whether outputs are concise, complete, or balanced.
Reference-based evaluation: It works best when you have a gold answer or target text to compare against.
Paraphrase sensitivity: It can reward outputs that preserve meaning even when wording differs.

Advantages of BERTScore

‍

Semantic awareness: It captures meaning more naturally than exact-match metrics.
Paraphrase friendly: It can score valid rewordings highly.
Simple to automate: It fits cleanly into batch evaluation pipelines.
Granular feedback: Precision, recall, and F1 help you diagnose different output patterns.
Widely used: It is a familiar baseline for text generation evaluation.

Challenges in BERTScore

‍

Model dependency: Scores can change based on the underlying encoder choice.
Reference dependence: It is less useful when no high-quality reference exists.
Not a truth checker: High semantic similarity does not guarantee factual accuracy.
Compute cost: Running transformer embeddings can be heavier than token-overlap metrics.
Interpretation gaps: A good score does not always map cleanly to human preference.

Example of BERTScore in Action

‍

Scenario: a team is evaluating a customer-support assistant that rewrites short policy answers.

If the reference says, “You can cancel within 30 days for a full refund,” and the model returns, “A full refund is available if you cancel within 30 days,” BERTScore should rate the output highly because the meaning is preserved even though the wording changed.

That makes it useful in regression tests for generation systems. The team can track whether newer prompts produce outputs that stay semantically aligned with the reference set, then review low-scoring cases by hand to see whether the issue is wording, omission, or a real meaning error.

How PromptLayer helps with BERTScore

‍

PromptLayer helps teams store prompt versions, run evaluations, and compare outputs over time, which makes it easier to pair BERTScore with other checks like rubric-based grading and human review. That gives you a clearer picture of whether a prompt change improved semantic quality, not just surface form.

Ready to try it yourself? Sign up for PromptLayer and start managing your prompts in minutes.