F-Eval: Assessing Fundamental Abilities with Refined Evaluation Methods

Large language models (LLMs) garner significant attention for their unprecedented performance, leading to an increasing number of researches evaluating LLMs. However, these evaluation benchmarks are limited to assessing the instruction-following capabilities, overlooking the fundamental abilities that emerge during the pre-training stage. Previous subjective evaluation methods mainly reply on scoring by API models. However, in the absence of references, large models have shown limited ability to discern subtle differences. To bridge the gap, we propose F-Eval, a bilingual evaluation benchmark to evaluate the fundamental abilities, including expression, commonsense and logic. The tasks in F-Eval include multi-choice objective tasks, open-ended objective tasks, reference-based subjective tasks and reference-free subjective tasks. For reference-free subjective tasks, we devise new evaluation methods, serving as alternatives to scoring by API models. We conduct evaluations on 13 advanced LLMs. Results show that our evaluation methods show higher correlation coefficients and larger distinction than other evaluators. Additionally, we discuss the influence of different model sizes, dimensions, and normalization methods. We anticipate that F-Eval will facilitate the study of LLMs' fundamental abilities.

翻译：大型语言模型（LLM）因其前所未有的性能表现而受到广泛关注，相关评估研究也日益增多。然而，现有评估基准多局限于测试模型遵循指令的能力，忽视了在预训练阶段涌现的基础能力。先前的主观评估方法主要依赖API模型进行评分，但在缺乏参考答案的情况下，大模型对细微差异的辨别能力有限。为弥补这一不足，我们提出了F-Eval——一个用于评估基础能力的双语评估基准，涵盖表达能力、常识推理与逻辑推理三大维度。F-Eval包含多项选择型客观任务、开放式客观任务、基于参考答案的主观任务以及无参考答案的主观任务四类测评任务。针对无参考答案的主观任务，我们设计了一套新的评估方法，以替代传统的API模型评分机制。我们对13个先进LLM进行了系统评估，结果表明：相较于其他评估工具，我们提出的评估方法具有更高的相关系数与更强的区分度。此外，我们还探讨了模型参数量、评估维度及归一化方法对评估结果的影响。我们期望F-Eval能够推动LLM基础能力研究的深入发展。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日