Published
Jul 30, 2024
Updated
Sep 15, 2024

Can AI Write Good Test Questions?

Comparison of Large Language Models for Generating Contextually Relevant Questions
By
Ivo Lodovico Molina|Valdemar Švábenský|Tsubasa Minematsu|Li Chen|Fumiya Okubo|Atsushi Shimada

Summary

Generating effective test questions is a crucial part of education. It's a time-consuming task for educators, and ensuring questions effectively assess student understanding is a constant challenge. Could AI help? A new study explored just that, comparing three large language models (LLMs)—GPT-3.5 Turbo, Flan T5 XXL, and Llama 2-Chat 13B—in their abilities to generate relevant questions from university slide text. Researchers used a two-step process: first, one LLM identified key concepts ('answers') within the slides. Then, all three LLMs created questions based on those answers. Student volunteers evaluated the generated questions across five key criteria: clarity, relevance, difficulty, connection to the slide content, and how well the question aligned with the provided answer. The results? GPT-3.5 and Llama 2-Chat 13B performed slightly better than Flan T5, especially in clarity and question-answer alignment, with GPT-3.5 demonstrating a particular knack for matching questions to answers. While all models showed promise, some weaknesses emerged. Aligning questions precisely with answers proved tricky, and the models sometimes generated overly generic questions. Another constraint is cost—while open-source models like Llama 2 offer a free alternative, the best-performing model, GPT-3.5, comes with usage fees. This could limit wider adoption in educational settings. Despite these limitations, the study highlights the potential of LLMs for personalized learning support, such as quick quizzes and knowledge reinforcement. For high-stakes assessments like formal exams, however, further refinement is needed to ensure accuracy and fairness. Future research could explore fine-tuning these models to improve their question-generation skills and address the identified limitations. The study's findings offer a glimpse into a future where AI could significantly reduce the burden on educators while providing students with more engaging and personalized learning experiences.
🍰 Interesting in building your own agents?
PromptLayer provides the tools to manage and monitor prompts with your whole team. Get started for free.

Question & Answers

What was the two-step process used by researchers to generate test questions using LLMs?
The researchers employed a sequential two-step process where first, one LLM identified key concepts (answers) from university slide content. Then, three different LLMs (GPT-3.5 Turbo, Flan T5 XXL, and Llama 2-Chat 13B) generated questions based on those identified answers. This approach ensured consistency in concept identification while allowing comparison of question generation capabilities across models. In practice, this could be implemented in an educational setting by first processing course materials through an LLM to extract key learning points, then using another LLM to create various assessment questions targeting those specific concepts.
How can AI help teachers save time in creating educational content?
AI can significantly reduce the time teachers spend on administrative tasks like creating test questions and quizzes. By automating the generation of assessment materials, teachers can focus more on actual teaching and student interaction. The technology can quickly process course materials and create relevant questions, providing a starting point that teachers can then refine. This is particularly valuable for creating practice questions, quick knowledge checks, and preliminary assessment materials. While AI-generated content may need human review, especially for high-stakes testing, it can dramatically streamline the content creation process and provide more time for meaningful educational activities.
What are the main advantages and limitations of using AI for educational assessment?
The main advantages of using AI for educational assessment include time savings for educators, the ability to quickly generate personalized learning materials, and consistent question creation across different topics. However, there are notable limitations: AI models sometimes generate overly generic questions, may struggle with precise question-answer alignment, and can involve significant costs for premium models like GPT-3.5. While AI shows promise for creating practice materials and quick assessments, it's not yet reliable enough for high-stakes testing without human oversight. The technology is best used as a supplementary tool to enhance, rather than replace, traditional assessment methods.

PromptLayer Features

  1. Testing & Evaluation
  2. The paper's structured evaluation methodology aligns with PromptLayer's testing capabilities for comparing multiple LLM outputs
Implementation Details
Set up batch testing pipeline comparing multiple models' question generation, implement scoring system based on the five evaluation criteria, track performance metrics across versions
Key Benefits
• Systematic comparison of multiple LLM outputs • Standardized evaluation across models • Quantifiable performance tracking
Potential Improvements
• Automated evaluation metrics • Integration with human feedback loops • Custom scoring templates for education context
Business Value
Efficiency Gains
Reduced time in manual evaluation of generated questions
Cost Savings
Optimized model selection based on performance/cost ratio
Quality Improvement
Consistent quality assessment across question generation
  1. Workflow Management
  2. The two-step question generation process maps to PromptLayer's multi-step orchestration capabilities
Implementation Details
Create reusable template for concept extraction step, chain with question generation step, version control both steps
Key Benefits
• Reproducible question generation pipeline • Trackable process iterations • Modular workflow design
Potential Improvements
• Enhanced error handling between steps • Dynamic template adjustment • Automated quality checks
Business Value
Efficiency Gains
Streamlined question generation process
Cost Savings
Reduced development time through reusable components
Quality Improvement
Consistent output quality through standardized workflow

The first platform built for prompt engineering