Recently, directly using large language models (LLMs) has been shown to be the most reliable method to evaluate QA models. However, it suffers from limited interpretability, high cost, and environmental harm. To address these, we propose to use soft EM with entity-driven answer set expansion. Our approach expands the gold answer set to include diverse surface forms, based on the observation that the surface forms often follow particular patterns depending on the entity type. The experimental results show that our method outperforms traditional evaluation methods by a large margin. Moreover, the reliability of our evaluation method is comparable to that of LLM-based ones, while offering the benefits of high interpretability and reduced environmental harm.
翻译:近年来,直接使用大语言模型(LLM)已被证明是评估问答(QA)模型最可靠的方法。然而,该方法存在可解释性有限、成本高昂以及环境危害等问题。为解决这些问题,我们提出使用基于实体的答案集扩展的软EM方法。我们的方法基于观察到的现象——表层形式通常根据实体类型遵循特定模式——扩展黄金答案集以包含多样化的表层形式。实验结果表明,我们的方法大幅优于传统评估方法。此外,我们的评估方法的可靠性可与基于LLM的方法相媲美,同时具备高可解释性和减少环境危害的优势。