This study focuses on the evaluation of the Open Question Answering (Open-QA) task, which can directly estimate the factuality of large language models (LLMs). Current automatic evaluation methods have shown limitations, indicating that human evaluation still remains the most reliable approach. We introduce a new task, Evaluating QA Evaluation (QA-Eval) and the corresponding dataset EVOUNA, designed to assess the accuracy of AI-generated answers in relation to standard answers within Open-QA. Our evaluation of these methods utilizes human-annotated results to measure their performance. Specifically, the work investigates methods that show high correlation with human evaluations, deeming them more reliable. We also discuss the pitfalls of current methods and methods to improve LLM-based evaluators. We believe this new QA-Eval task and corresponding dataset EVOUNA will facilitate the development of more effective automatic evaluation tools and prove valuable for future research in this area. All resources are available at \url{https://github.com/wangcunxiang/QA-Eval} and it is under the Apache-2.0 License.
翻译:本研究聚焦于开放问答(Open-QA)任务的评估,该任务可直接评估大型语言模型(LLMs)的事实准确性。当前自动评估方法存在局限性,表明人工评估仍是最可靠的方式。我们提出一项新任务——问答评估评估(QA-Eval)及相应数据集EVOUNA,旨在评估开放问答中AI生成答案相对于标准答案的准确性。我们利用人工标注结果来衡量这些方法的性能。具体而言,本研究探讨了与人工评估高度相关的方法,将其视为更可靠的标准。我们还讨论了当前方法的缺陷以及改进基于LLM的评估器的方法。我们相信,这一新的QA-Eval任务及相应数据集EVOUNA将促进更有效的自动评估工具的开发,并为该领域的未来研究提供重要价值。所有资源均可在\url{https://github.com/wangcunxiang/QA-Eval}获取,并遵循Apache-2.0许可证。