Large language models (LLMs) garner significant attention for their unprecedented performance, leading to an increasing number of researches evaluating LLMs. However, these evaluation benchmarks are limited to assessing the instruction-following capabilities, overlooking the fundamental abilities that emerge during the pre-training stage. Previous subjective evaluation methods mainly reply on scoring by API models. However, in the absence of references, large models have shown limited ability to discern subtle differences. To bridge the gap, we propose F-Eval, a bilingual evaluation benchmark to evaluate the fundamental abilities, including expression, commonsense and logic. The tasks in F-Eval include multi-choice objective tasks, open-ended objective tasks, reference-based subjective tasks and reference-free subjective tasks. For reference-free subjective tasks, we devise new evaluation methods, serving as alternatives to scoring by API models. We conduct evaluations on 13 advanced LLMs. Results show that our evaluation methods show higher correlation coefficients and larger distinction than other evaluators. Additionally, we discuss the influence of different model sizes, dimensions, and normalization methods. We anticipate that F-Eval will facilitate the study of LLMs' fundamental abilities.
翻译:大语言模型凭借其前所未有的性能引发了广泛关注,促使越来越多的研究对其展开评估。然而,现有评估基准多局限于衡量指令遵循能力,忽视了预训练阶段涌现的基础能力。以往的主观评估方法主要依赖API模型进行评分,但在缺乏参考标准的情况下,大模型对细微差异的辨别能力有限。为弥补这一不足,我们提出F-Eval——一个用于评估基础能力(包括表达能力、常识与逻辑)的双语评估基准。F-Eval包含多项选择客观任务、开放式客观任务、基于参考的主观任务和无参考的主观任务。针对无参考的主观任务,我们设计了新型评估方法,作为API模型评分的替代方案。我们对13个先进大语言模型进行了评估,结果显示,与其他评估方法相比,我们的方法具有更高的相关系数与更强的区分度。此外,我们探讨了模型参数量、评估维度及归一化方法的影响。我们期待F-Eval能推动大语言模型基础能力的研究。