MIRROR：一种用于开放式问题生成自动评估的新方法 (MIRROR: A Novel Approach for the Automated Evaluation of Open-Ended Question Generation)

Automatic question generation is a critical task that involves evaluating question quality by considering factors such as engagement, pedagogical value, and the ability to stimulate critical thinking. These aspects require human-like understanding and judgment, which automated systems currently lack. However, human evaluations are costly and impractical for large-scale samples of generated questions. Therefore, we propose a novel system, MIRROR (Multi-LLM Iterative Review and Response for Optimized Rating), which leverages large language models (LLMs) to automate the evaluation process for questions generated by automated question generation systems. We experimented with several state-of-the-art LLMs, such as GPT-4, Gemini, and Llama2-70b. We observed that the scores of human evaluation metrics, namely relevance, appropriateness, novelty, complexity, and grammaticality, improved when using the feedback-based approach called MIRROR, tending to be closer to the human baseline scores. Furthermore, we observed that Pearson's correlation coefficient between GPT-4 and human experts improved when using our proposed feedback-based approach, MIRROR, compared to direct prompting for evaluation. Error analysis shows that our proposed approach, MIRROR, significantly helps to improve relevance and appropriateness.

翻译：自动问题生成是一项关键任务，其涉及通过考量参与度、教学价值以及激发批判性思维的能力等因素来评估问题质量。这些方面需要类人的理解和判断，而当前的自动化系统尚不具备这种能力。然而，对于大规模生成的问题样本，人工评估成本高昂且不切实际。因此，我们提出了一种新颖的系统——MIRROR（用于优化评分的多LLM迭代评审与响应），该系统利用大语言模型（LLMs）来自动化评估由自动问题生成系统所生成的问题。我们实验了多种最先进的LLMs，例如GPT-4、Gemini和Llama2-70b。我们观察到，当使用名为MIRROR的基于反馈的方法时，人工评估指标（即相关性、适当性、新颖性、复杂性和语法正确性）的得分有所提高，并且趋近于人工基线分数。此外，我们观察到，与直接提示进行评估相比，使用我们提出的基于反馈的方法MIRROR时，GPT-4与人类专家之间的皮尔逊相关系数有所改善。误差分析表明，我们提出的MIRROR方法显著有助于提高相关性和适当性。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

31+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日