As large language models (LLMs) become increasingly common in educational applications, there is a growing need for evidence-based methods to design and evaluate LLM prompts that produce personalized and pedagogically aligned out-puts. This study presents a generalizable, systematic approach for evaluating prompts, demonstrated through an analysis of LLM-generated follow-up questions in a structured dialogue activity. Six prompt templates were designed and tested. The templates incorporated established prompt engineering patterns, with each prompt emphasizing distinct pedagogical strategies. The prompt templates were compared through a tournament-style evaluation framework that can be adapted for other educational applications. The tournament employed the Glicko2 rating system with eight judges evaluating question pairs across three dimensions: format, dialogue support, and appropriateness for learners. Data was sourced from 120 authentic user interactions across three distinct educational deployments. Results showed that a single prompt related to strategic reading out-performed other templates with win probabilities ranging from 81% to 100% in pairwise comparisons. This prompt combined persona and context manager pat-terns and was designed to support metacognitive learning strategies such as self-directed learning. The methodology showcases how educational technology re- searchers can systematically evaluate and improve prompt designs, moving beyond ad-hoc prompt engineering toward evidence-based prompt development for educational applications.
翻译:随着大型语言模型(LLM)在教育应用中日渐普及,亟需基于证据的方法来设计和评估能够生成个性化且符合教学目标的输出的LLM提示词。本研究提出了一种可推广的、系统化的提示词评估方法,并通过分析结构化对话活动中LLM生成的后续问题进行了实证演示。研究设计并测试了六种提示词模板。这些模板融合了成熟的提示工程模式,每种提示词侧重不同的教学策略。我们通过一种可适用于其他教育应用的锦标赛式评估框架对这些提示词模板进行了比较。该锦标赛采用Glicko2评分系统,由八位评审从格式、对话支持度以及对学习者的适切性三个维度对问题对进行评估。数据来源于三个不同教育部署场景下的120次真实用户交互。结果显示,一个与策略性阅读相关的提示词在成对比较中以81%至100%的胜率优于其他模板。该提示词结合了角色扮演和上下文管理两种模式,旨在支持元认知学习策略,如自主学习。该方法展示了教育技术研究者如何系统评估和改进提示词设计,推动教育应用从临时的提示工程转向基于证据的提示开发。