Evaluating the capability of Large Language Models (LLMs) in following instructions has heavily relied on a powerful LLM as the judge, introducing unresolved biases that deviate the judgments from human judges. In this work, we reevaluate various choices for automatic evaluation on a wide range of instruction-following tasks. We experiment with methods that leverage human-written responses and observe that they enhance the reliability of automatic evaluations across a wide range of tasks, resulting in up to a 3.2% improvement in agreement with human judges. We also discovered that human-written responses offer an orthogonal perspective to model-generated responses in following instructions and should be used as an additional context when comparing model responses. Based on these observations, we develop a new evaluation benchmark, Human Response-Guided Evaluation of Instruction Following (HREF), comprising 4,258 samples across 11 task categories with a composite evaluation setup, employing a composite evaluation setup that selects the most reliable method for each category. In addition to providing reliable evaluation, HREF emphasizes individual task performance and is free from contamination. Finally, we study the impact of key design choices in HREF, including the size of the evaluation set, the judge model, the baseline model, and the prompt template. We host a live leaderboard that evaluates LLMs on the private evaluation set of HREF.
翻译:评估大型语言模型(LLM)遵循指令的能力严重依赖于使用强大的LLM作为评判者,这引入了未解决的偏差,导致其判断与人类评判者产生偏离。在本工作中,我们重新评估了在广泛指令遵循任务中自动评估的各种方案。我们尝试了利用人类撰写响应的方法,并观察到这些方法提升了跨多种任务自动评估的可靠性,与人类评判者的一致性最高可提升3.2%。我们还发现,在遵循指令方面,人类撰写的响应为模型生成的响应提供了正交的视角,在比较模型响应时应将其作为额外的上下文信息使用。基于这些观察,我们开发了一个新的评估基准——基于人类响应的指令遵循评估(HREF),该基准包含11个任务类别共计4,258个样本,并采用复合评估设置,即为每个类别选择最可靠的评估方法。除了提供可靠的评估外,HREF强调个体任务表现,并且避免了数据污染。最后,我们研究了HREF中关键设计选择的影响,包括评估集大小、评判模型、基线模型和提示模板。我们建立了一个实时排行榜,在HREF的私有评估集上对LLM进行评估。