Automatic question generation is a critical task that involves evaluating question quality by considering factors such as engagement, pedagogical value, and the ability to stimulate critical thinking. These aspects require human-like understanding and judgment, which automated systems currently lack. However, human evaluations are costly and impractical for large-scale samples of generated questions. Therefore, we propose a novel system, MIRROR (Multi-LLM Iterative Review and Response for Optimized Rating), which leverages large language models (LLMs) to automate the evaluation process for questions generated by automated question generation systems. We experimented with several state-of-the-art LLMs, such as GPT-4, Gemini, and Llama2-70b. We observed that the scores of human evaluation metrics, namely relevance, appropriateness, novelty, complexity, and grammaticality, improved when using the feedback-based approach called MIRROR, tending to be closer to the human baseline scores. Furthermore, we observed that Pearson's correlation coefficient between GPT-4 and human experts improved when using our proposed feedback-based approach, MIRROR, compared to direct prompting for evaluation. Error analysis shows that our proposed approach, MIRROR, significantly helps to improve relevance and appropriateness.
翻译:自动问题生成是一项关键任务,其涉及通过考量参与度、教学价值以及激发批判性思维的能力等因素来评估问题质量。这些方面需要类人的理解和判断,而当前的自动化系统尚不具备这种能力。然而,对于大规模生成的问题样本,人工评估成本高昂且不切实际。因此,我们提出了一种新颖的系统——MIRROR(用于优化评分的多LLM迭代评审与响应),该系统利用大语言模型(LLMs)来自动化评估由自动问题生成系统所生成的问题。我们实验了多种最先进的LLMs,例如GPT-4、Gemini和Llama2-70b。我们观察到,当使用名为MIRROR的基于反馈的方法时,人工评估指标(即相关性、适当性、新颖性、复杂性和语法正确性)的得分有所提高,并且趋近于人工基线分数。此外,我们观察到,与直接提示进行评估相比,使用我们提出的基于反馈的方法MIRROR时,GPT-4与人类专家之间的皮尔逊相关系数有所改善。误差分析表明,我们提出的MIRROR方法显著有助于提高相关性和适当性。