Comparative Analysis of GPT-4 and Human Graders in Evaluating Praise Given to Students in Synthetic Dialogues

from arxiv, 12 pages Workshop paper, The 24th International Conference on Artificial Intelligence in Education, AIED 2023 Educational Dialogue Act Classification, Large Language Models, Tutor Training

Research suggests that providing specific and timely feedback to human tutors enhances their performance. However, it presents challenges due to the time-consuming nature of assessing tutor performance by human evaluators. Large language models, such as the AI-chatbot ChatGPT, hold potential for offering constructive feedback to tutors in practical settings. Nevertheless, the accuracy of AI-generated feedback remains uncertain, with scant research investigating the ability of models like ChatGPT to deliver effective feedback. In this work-in-progress, we evaluate 30 dialogues generated by GPT-4 in a tutor-student setting. We use two different prompting approaches, the zero-shot chain of thought and the few-shot chain of thought, to identify specific components of effective praise based on five criteria. These approaches are then compared to the results of human graders for accuracy. Our goal is to assess the extent to which GPT-4 can accurately identify each praise criterion. We found that both zero-shot and few-shot chain of thought approaches yield comparable results. GPT-4 performs moderately well in identifying instances when the tutor offers specific and immediate praise. However, GPT-4 underperforms in identifying the tutor's ability to deliver sincere praise, particularly in the zero-shot prompting scenario where examples of sincere tutor praise statements were not provided. Future work will focus on enhancing prompt engineering, developing a more general tutoring rubric, and evaluating our method using real-life tutoring dialogues.

翻译：研究表明，向人类导师提供具体且及时的反馈有助于提升其教学表现。然而，由于人类评估者评估导师表现耗时较长，这一过程面临挑战。大型语言模型（如AI聊天机器人ChatGPT）在实际场景中具有为导师提供建设性反馈的潜力。不过，AI生成反馈的准确性仍不确定，关于ChatGPT等模型提供有效反馈能力的研究尚不充分。在这项进行中的工作中，我们评估了GPT-4生成的30段师生对话。采用零样本思维链与少样本思维链两种提示方法，基于五个标准识别有效表扬的具体构成要素。随后将这些方法的结果与人类评分者进行准确性比较。我们的目标是评估GPT-4在多大程度上能准确识别每个表扬标准。研究发现零样本与少样本思维链方法结果相近：GPT-4在识别导师提供具体与即时表扬的实例时表现中等，但在识别导师传达真诚表扬的能力上表现不足——尤其在未提供导师真诚表扬范例的零样本提示场景中。未来工作将聚焦于优化提示工程、制定更通用的教学评价框架，并使用真实教学对话评估本方法。