The continuous advancement of large language models (LLMs) has brought increasing attention to the critical issue of developing fair and reliable methods for evaluating their performance. Particularly, the emergence of subjective or non-subjective cheating phenomena, such as test set leakage and prompt format overfitting, poses significant challenges to the reliable evaluation of LLMs. Since evaluation frameworks often utilize Regular Expression (RegEx) for answer extraction, some models may adjust their responses to comply with specific formats that are easily extractable by RegEx. Nevertheless, the key answer extraction module based on RegEx frequently suffers from extraction errors. This paper conducts a comprehensive analysis of the entire LLM evaluation chain, demonstrating that optimizing the key answer extraction module can improve extraction accuracy, reduce LLMs' reliance on specific answer formats, and enhance the reliability of LLM evaluation. To address these issues, we propose xFinder, a model specifically designed for key answer extraction. As part of this process, we create a specialized dataset, the Key Answer Finder (KAF) dataset, to ensure effective model training and evaluation. Through generalization testing and evaluation in real-world scenarios, the results demonstrate that the smallest xFinder model with only 500 million parameters achieves an average answer extraction accuracy of 93.42%. In contrast, RegEx accuracy in the best evaluation framework is 74.38%. xFinder exhibits stronger robustness and higher accuracy compared to existing evaluation frameworks. All resources for xFinder are available at \url{https://github.com/IAAR-Shanghai/xFinder}.
翻译:随着大语言模型(LLMs)的持续进步,如何开发公平可靠的性能评估方法已成为关键议题。特别地,测试集泄露与提示格式过拟合等主观或非主观作弊现象的出现,对LLMs的可信评估构成了重大挑战。由于现有评估框架常采用正则表达式(RegEx)进行答案提取,部分模型会调整输出以符合RegEx易提取的特定格式。然而,基于RegEx的关键答案提取模块频繁出现提取错误。本文对整个LLM评估链路进行了全面分析,证明优化关键答案提取模块可提升提取准确率、降低LLMs对特定答案格式的依赖,并增强评估可靠性。针对上述问题,我们提出专用于关键答案提取的模型xFinder。研究过程中构建了专用数据集Key Answer Finder(KAF),以保障模型的有效训练与评估。通过跨场景泛化测试与实际场景评估,结果表明:仅含5亿参数的最小xFinder模型平均答案提取准确率达93.42%,而最佳评估框架中RegEx的准确率仅为74.38%。相较现有评估框架,xFinder展现出更强的鲁棒性与更高精度。xFinder全部资源已开源至 \url{https://github.com/IAAR-Shanghai/xFinder}。