Automatic question generation is a critical task that involves evaluating question quality by considering factors such as engagement, pedagogical value, and the ability to stimulate critical thinking. These aspects require human-like understanding and judgment, which automated systems currently lack. However, human evaluations are costly and impractical for large-scale samples of generated questions. Therefore, we propose a novel system, MIRROR (Multi-LLM Iterative Review and Response for Optimized Rating), which leverages large language models (LLMs) to automate the evaluation process for questions generated by automated question generation systems. We experimented with several state-of-the-art LLMs, such as GPT-4, Gemini, and Llama2-70b. We observed that the scores of human evaluation metrics, namely relevance, appropriateness, novelty, complexity, and grammaticality, improved when using the feedback-based approach called MIRROR, tending to be closer to the human baseline scores. Furthermore, we observed that Pearson's correlation coefficient between GPT-4 and human experts improved when using our proposed feedback-based approach, MIRROR, compared to direct prompting for evaluation. Error analysis shows that our proposed approach, MIRROR, significantly helps to improve relevance and appropriateness.
翻译:自动问题生成是一项关键任务,其评估问题质量需考量参与度、教学价值以及激发批判性思维的能力等多重因素。这些方面需要类人的理解与判断,而现有自动化系统尚不具备此能力。然而,人工评估成本高昂,且难以适用于大规模生成问题样本。为此,我们提出一种新颖系统MIRROR(基于多LLM迭代评审与反馈的优化评分系统),该系统利用大语言模型(LLMs)对自动问题生成系统所产生的问题进行自动化评估。我们实验了多种前沿LLMs,如GPT-4、Gemini和Llama2-70b。研究发现,当采用名为MIRROR的基于反馈的评估方法时,人工评估指标(包括相关性、适切性、新颖性、复杂性和语法正确性)的得分均有所提升,且更趋近于人工基准分数。此外,与直接提示评估相比,使用我们提出的基于反馈的MIRROR方法时,GPT-4与人类专家评分的皮尔逊相关系数亦得到改善。误差分析表明,我们提出的MIRROR方法能显著提升问题生成的相关性与适切性。