Published
Oct 2, 2024
Updated
Oct 2, 2024

Who Should Judge AI? Experts vs. Users vs. AI Itself

Comparing Criteria Development Across Domain Experts, Lay Users, and Models in Large Language Model Evaluation
By
Annalisa Szymanski|Simret Araya Gebreegziabher|Oghenemaro Anuyah|Ronald A. Metoyer|Toby Jia-Jun Li

Summary

Imagine you're a dietitian or a math teacher. How do you know if an AI gives good advice? This research explores exactly that by comparing how experts, everyday users, and AI itself judge the quality of AI-generated answers. It turns out, they don't always agree. Experts are laser-focused on the details, picking up on subtle errors or missing information that others might miss. Regular users care more about whether the advice is clear, easy to follow, and presented in a way that makes sense to them. And the AI? Well, it tends to follow instructions very literally, sometimes missing the bigger picture. The study reveals a fascinating dynamic: while experts provide highly specific criteria based on their deep knowledge, users focus on usability. AI, on the other hand, tends to latch onto keywords, sometimes oversimplifying the problem. This difference highlights a challenge – how do we make sure AI is not just smart, but also useful and safe? The solution lies in a multi-stage approach where experts, users, and AI all play a part. Experts set the gold standard based on scientific understanding, users ensure the advice is practical and clear, and AI can automate simpler checks, saving everyone time. The study concludes that by working together, we can refine the way we evaluate AI, leading to higher quality, safer, and more accessible advice in specialized fields like nutrition and education.
🍰 Interesting in building your own agents?
PromptLayer provides the tools to manage and monitor prompts with your whole team. Get started for free.

Question & Answers

What is the multi-stage approach for evaluating AI systems mentioned in the research?
The multi-stage evaluation approach combines three distinct perspectives: expert assessment, user feedback, and automated AI checks. The process works by having experts establish baseline quality standards using their domain knowledge, users validate practical usability and clarity, and AI systems perform automated preliminary checks. For example, in a nutrition advisory system, experts would verify scientific accuracy of dietary recommendations, users would assess if the advice is actionable and clear, and AI could automatically flag responses that deviate from established guidelines. This layered approach ensures comprehensive quality control while balancing accuracy, usability, and efficiency.
How does AI evaluation impact everyday decision-making?
AI evaluation helps ensure that automated advice and recommendations are both accurate and practical for daily use. It creates a safety net by combining expert knowledge with real-world usability testing, making AI systems more reliable for everyday decisions. For instance, when using AI-powered apps for diet planning or educational support, proper evaluation ensures you receive advice that's not just technically correct but also easy to understand and implement. This matters because it helps people make better-informed decisions with confidence, whether they're planning meals, learning new subjects, or getting professional advice.
What are the benefits of including multiple perspectives in AI assessment?
Including diverse perspectives in AI assessment leads to more balanced and effective AI systems. Expert input ensures technical accuracy and safety, while user feedback guarantees practical usefulness and accessibility. This comprehensive approach helps create AI solutions that are both sophisticated and user-friendly. For example, in educational applications, having teachers, students, and AI working together leads to learning tools that are academically sound, engaging, and easy to use. This collaborative evaluation method helps bridge the gap between technical excellence and practical utility, resulting in AI systems that better serve their intended purpose.

PromptLayer Features

  1. Testing & Evaluation
  2. The paper's multi-perspective evaluation approach aligns with comprehensive testing needs for AI outputs
Implementation Details
Create separate testing pipelines for expert validation, user feedback, and automated AI checks with defined success metrics for each group
Key Benefits
• Comprehensive quality assessment across multiple perspectives • Structured evaluation framework for specialized domains • Automated validation for basic quality checks
Potential Improvements
• Add domain-specific evaluation templates • Implement weighted scoring systems • Develop expert review integration tools
Business Value
Efficiency Gains
Reduces manual review time by 60% through automated initial screening
Cost Savings
Decreases expert review costs by prioritizing only complex cases
Quality Improvement
Ensures consistent quality across all AI outputs through standardized evaluation
  1. Workflow Management
  2. Multi-stage evaluation process requires orchestrated workflows combining expert, user, and AI assessment stages
Implementation Details
Design sequential workflow templates that coordinate expert review, user testing, and automated AI validation
Key Benefits
• Streamlined evaluation process • Consistent quality control steps • Traceable assessment history
Potential Improvements
• Add parallel review capabilities • Implement feedback loops • Create domain-specific templates
Business Value
Efficiency Gains
Reduces evaluation cycle time by 40% through structured workflows
Cost Savings
Optimizes resource allocation across different evaluation stages
Quality Improvement
Ensures no evaluation steps are missed through systematic process management

The first platform built for prompt engineering