Achieving Human Level Partial Credit Grading of Written Responses to Physics Conceptual Question using GPT-3.5 with Only Prompt Engineering

Published

Jul 21, 2024

Updated

Jul 21, 2024

Can AI Grade Your Physics Exams?

Achieving Human Level Partial Credit Grading of Written Responses to Physics Conceptual Question using GPT-3.5 with Only Prompt Engineering

Zhongzhou Chen|Tong Wan

https://arxiv.org/abs/2407.15251v1

Summary

Grading student work is a time-consuming task, especially in STEM fields like physics. What if AI could help? Researchers at the University of Central Florida explored using GPT-3.5 to grade student answers to a physics conceptual question. They didn't retrain the model; instead, they used a clever prompting technique called "scaffolded chain of thought." This approach provides GPT-3.5 with a detailed rubric and guides it to compare student answers to specific criteria step by step. The results were impressive. The AI grader, using scaffolded prompting, achieved 70-80% agreement with human graders – comparable to the level of agreement *between* two human graders. This suggests that with the right prompting, AI could potentially handle grading tasks with human-level accuracy, freeing up instructors' time for other important work. The next step is to expand this technique to other STEM fields, like engineering, where evaluating complex solutions to open-ended problems is a major part of education. While this initial study focused on a single physics problem, it opens the door for exciting possibilities in automated assessment, potentially transforming how educators evaluate and provide feedback to students. Further research will need to examine the reproducibility of these findings and investigate the effectiveness of this approach on a wider range of problems.

🍰 Interesting in building your own agents?

PromptLayer provides the tools to manage and monitor prompts with your whole team. Get started for free.

Question & Answers

How does the scaffolded chain of thought prompting technique work in AI grading systems?

The scaffolded chain of thought technique provides AI with a structured approach to grading by breaking down the evaluation process into distinct steps. It works by first giving GPT-3.5 a detailed rubric, then guiding it through a systematic comparison of student answers against specific criteria. For example, when grading a physics problem, the AI would: 1) Review the rubric requirements, 2) Analyze the student's answer component by component, 3) Compare each component against rubric criteria, and 4) Assign grades based on met criteria. This methodical approach achieved 70-80% agreement with human graders, matching the consistency level between human graders themselves.

What are the potential benefits of AI grading systems in education?

AI grading systems offer several key advantages in educational settings. They can significantly reduce teachers' workload by automating time-consuming assessment tasks, allowing educators to focus more on instruction and student interaction. These systems provide consistent evaluation criteria across all submissions, eliminating potential human bias and fatigue-related errors. For students, AI grading can offer immediate feedback, enabling faster learning cycles. The technology is particularly valuable in STEM fields where problems often have complex, multi-step solutions that require detailed evaluation.

How might AI transform assessment methods in different academic subjects?

AI is poised to revolutionize assessment methods across various academic disciplines. Beyond just grading multiple-choice tests, modern AI can evaluate complex written responses, mathematical proofs, and even creative work. It can provide instant feedback on writing style, logical coherence, and technical accuracy. The technology's ability to analyze patterns and maintain consistency makes it valuable for subjects ranging from literature to engineering. This transformation could lead to more frequent assessments, personalized feedback, and adaptive learning paths that adjust to individual student needs and learning styles.

PromptLayer Features

Prompt Management
The study's scaffolded chain-of-thought prompting technique requires careful prompt versioning and standardization

Implementation Details

Create versioned prompt templates with rubric integration, establish standardized grading criteria blocks, implement role-based access for different graders

Key Benefits

• Consistent grading criteria across multiple assessments • Traceable prompt evolution and improvements • Collaborative refinement of grading prompts

Potential Improvements

• Auto-generated rubric integration • Subject-specific prompt templates • Multi-language prompt support

Business Value

Efficiency Gains

50% reduction in prompt development time through reusable templates

Cost Savings

Reduced need for multiple human graders and standardization meetings

Quality Improvement

More consistent grading across different evaluators and assignments

Analytics
Testing & Evaluation
The research compared AI grading accuracy with human graders, requiring systematic evaluation methods

Implementation Details

Set up A/B testing between different prompt versions, implement regression testing against human-graded samples, create scoring metrics for grading accuracy

Key Benefits

• Quantifiable grading accuracy measurements • Systematic prompt performance tracking • Early detection of grading inconsistencies

Potential Improvements

• Automated accuracy threshold alerts • Integration with educational benchmarks • Real-time grading validation

Business Value

Efficiency Gains

75% faster prompt optimization through automated testing

Cost Savings

Reduced need for manual verification of AI grading accuracy

Quality Improvement

Higher confidence in AI grading through continuous validation

Can AI Grade Your Physics Exams?

Summary

Question & Answers

PromptLayer Features

The first platform built for prompt engineering