The continuous advancement of large language models (LLMs) has brought increasing attention to the critical issue of developing fair and reliable methods for evaluating their performance. Particularly, the emergence of subjective or non-subjective cheating phenomena, such as test set leakage and prompt format overfitting, poses significant challenges to the reliable evaluation of LLMs. Since evaluation frameworks often utilize Regular Expression (RegEx) for answer extraction, some models may adjust their responses to comply with specific formats that are easily extractable by RegEx. Nevertheless, the key answer extraction module based on RegEx frequently suffers from extraction errors. This paper conducts a comprehensive analysis of the entire LLM evaluation chain, demonstrating that optimizing the key answer extraction module can improve extraction accuracy, reduce LLMs' reliance on specific answer formats, and enhance the reliability of LLM evaluation. To address these issues, we propose xFinder, a model specifically designed for key answer extraction. As part of this process, we create a specialized dataset, the Key Answer Finder (KAF) dataset, to ensure effective model training and evaluation. Through generalization testing and evaluation in real-world scenarios, the results demonstrate that the smallest xFinder model with only 500 million parameters achieves an average answer extraction accuracy of 93.42%. In contrast, RegEx accuracy in the best evaluation framework is 74.38%. xFinder exhibits stronger robustness and higher accuracy compared to existing evaluation frameworks.
翻译:随着大语言模型的持续发展,如何建立公平可靠的性能评估方法日益受到关注。特别是测试集泄露、提示格式过拟合等主客观作弊现象的出现,对大语言模型的可信评估构成了重大挑战。由于评估框架通常采用正则表达式进行答案提取,部分模型可能调整其输出以适配易于被正则表达式提取的特定格式。然而,基于正则表达式的关键答案提取模块常存在提取错误问题。本文通过对大语言模型评估全链路的综合分析,证明优化关键答案提取模块能够提升提取准确率,降低大语言模型对特定答案格式的依赖,从而增强评估的可靠性。为解决上述问题,我们提出了专门用于关键答案提取的模型xFinder。在此过程中,我们构建了专用数据集——关键答案查找数据集,以确保模型训练与评估的有效性。通过泛化测试与真实场景评估,结果表明仅含5亿参数的最小xFinder模型实现了平均93.42%的答案提取准确率,而最佳评估框架中正则表达式的准确率为74.38%。与现有评估框架相比,xFinder展现出更强的鲁棒性与更高的准确性。