Automatic Question Generation (QG) often produces outputs with critical defects, such as factual hallucinations and answer mismatches. However, existing evaluation methods, including LLM-based evaluators, mainly adopt a black-box and holistic paradigm without explicit error modeling, leading to the neglect of such defects and overestimation of question quality. To address this issue, we propose ErrEval, a flexible and Error-aware Evaluation framework that enhances QG evaluation through explicit error diagnostics. Specifically, ErrEval reformulates evaluation as a two-stage process of error diagnosis followed by informed scoring. At the first stage, a lightweight plug-and-play Error Identifier detects and categorizes common errors across structural, linguistic, and content-related aspects. These diagnostic signals are then incorporated as explicit evidence to guide LLM evaluators toward more fine-grained and grounded judgments. Extensive experiments on three benchmarks demonstrate the effectiveness of ErrEval, showing that incorporating explicit diagnostics improves alignment with human judgments. Further analyses confirm that ErrEval effectively mitigates the overestimation of low-quality questions.
翻译:自动问题生成(QG)常产生具有关键缺陷的输出,例如事实性幻觉和答案不匹配。然而,现有评估方法(包括基于LLM的评估器)主要采用黑盒式整体评估范式,缺乏显式的错误建模,导致此类缺陷被忽视且问题质量被高估。为解决此问题,我们提出ErrEval,一个灵活且错误感知的评估框架,通过显式错误诊断增强QG评估。具体而言,ErrEval将评估重构为错误诊断与知情评分的两阶段过程。在第一阶段,一个轻量级即插即用的错误识别器检测并分类结构、语言和内容方面的常见错误。这些诊断信号随后作为显式证据纳入,以指导LLM评估器做出更细粒度且基于事实的判断。在三个基准数据集上的大量实验证明了ErrEval的有效性,表明引入显式诊断能提升与人类判断的一致性。进一步分析证实,ErrEval能有效缓解对低质量问题的高估现象。