HREF：基于人类响应的语言模型指令遵循能力评估 (HREF: Human Response-Guided Evaluation of Instruction Following in Language Models)

Evaluating the capability of Large Language Models (LLMs) in following instructions has heavily relied on a powerful LLM as the judge, introducing unresolved biases that deviate the judgments from human judges. In this work, we reevaluate various choices for automatic evaluation on a wide range of instruction-following tasks. We experiment with methods that leverage human-written responses and observe that they enhance the reliability of automatic evaluations across a wide range of tasks, resulting in up to a 3.2% improvement in agreement with human judges. We also discovered that human-written responses offer an orthogonal perspective to model-generated responses in following instructions and should be used as an additional context when comparing model responses. Based on these observations, we develop a new evaluation benchmark, Human Response-Guided Evaluation of Instruction Following (HREF), comprising 4,258 samples across 11 task categories with a composite evaluation setup, employing a composite evaluation setup that selects the most reliable method for each category. In addition to providing reliable evaluation, HREF emphasizes individual task performance and is free from contamination. Finally, we study the impact of key design choices in HREF, including the size of the evaluation set, the judge model, the baseline model, and the prompt template. We host a live leaderboard that evaluates LLMs on the private evaluation set of HREF.

翻译：评估大型语言模型（LLM）遵循指令的能力严重依赖于使用强大的LLM作为评判者，这引入了未解决的偏差，导致其判断与人类评判者产生偏离。在本工作中，我们重新评估了在广泛指令遵循任务中自动评估的各种方案。我们尝试了利用人类撰写响应的方法，并观察到这些方法提升了跨多种任务自动评估的可靠性，与人类评判者的一致性最高可提升3.2%。我们还发现，在遵循指令方面，人类撰写的响应为模型生成的响应提供了正交的视角，在比较模型响应时应将其作为额外的上下文信息使用。基于这些观察，我们开发了一个新的评估基准——基于人类响应的指令遵循评估（HREF），该基准包含11个任务类别共计4,258个样本，并采用复合评估设置，即为每个类别选择最可靠的评估方法。除了提供可靠的评估外，HREF强调个体任务表现，并且避免了数据污染。最后，我们研究了HREF中关键设计选择的影响，包括评估集大小、评判模型、基线模型和提示模板。我们建立了一个实时排行榜，在HREF的私有评估集上对LLM进行评估。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日