Foundational Autoraters: Taming Large Language Models for Better Automatic Evaluation

As large language models (LLMs) advance, it becomes more challenging to reliably evaluate their output due to the high costs of human evaluation. To make progress towards better LLM autoraters, we introduce FLAMe, a family of Foundational Large Autorater Models. FLAMe is trained on our large and diverse collection of 100+ quality assessment tasks comprising 5M+ human judgments, curated and standardized using publicly released human evaluations from previous research. FLAMe significantly improves generalization to a wide variety of held-out tasks, outperforming LLMs trained on proprietary data like GPT-4 and Claude-3 on many tasks. We show that FLAMe can also serve as a powerful starting point for further downstream fine-tuning, using reward modeling evaluation as a case study (FLAMe-RM). Notably, on RewardBench, our FLAMe-RM-24B model (with an accuracy of 87.8%) is the top-performing generative model trained exclusively on permissively licensed data, outperforming both GPT-4-0125 (85.9%) and GPT-4o (84.7%). Additionally, we explore a more computationally efficient approach using a novel tail-patch fine-tuning strategy to optimize our FLAMe multitask mixture for reward modeling evaluation (FLAMe-Opt-RM), offering competitive RewardBench performance while requiring approximately 25x less training datapoints. Overall, our FLAMe variants outperform all popular proprietary LLM-as-a-Judge models we consider across 8 out of 12 autorater evaluation benchmarks, encompassing 53 quality assessment tasks, including RewardBench and LLM-AggreFact. Finally, our analysis reveals that FLAMe is significantly less biased than these LLM-as-a-Judge models on the CoBBLEr autorater bias benchmark, while effectively identifying high-quality responses for code generation.

翻译：随着大型语言模型（LLM）的进步，由于其输出的人工评估成本高昂，可靠评估其输出变得更具挑战性。为了推动更好的LLM自动评分器的发展，我们引入了FLAMe，一个基础大型自动评分模型系列。FLAMe在我们收集的大规模、多样化质量评估任务集（包含100多个任务，涵盖500多万个人类判断）上进行训练，这些数据通过整理和标准化先前研究中公开发布的人类评估结果而构建。FLAMe在广泛多样的保留任务上显著提高了泛化能力，在许多任务上超越了基于专有数据训练的LLM（如GPT-4和Claude-3）。我们证明FLAMe还可以作为下游进一步微调的强大起点，并以奖励建模评估作为案例研究（FLAMe-RM）。值得注意的是，在RewardBench上，我们的FLAMe-RM-24B模型（准确率为87.8%）是仅使用宽松许可数据训练的性能最佳的生成模型，超越了GPT-4-0125（85.9%）和GPT-4o（84.7%）。此外，我们探索了一种计算效率更高的方法，采用新颖的尾部补丁微调策略来优化我们的FLAMe多任务混合模型以用于奖励建模评估（FLAMe-Opt-RM），在RewardBench上提供了有竞争力的性能，同时所需训练数据点减少了约25倍。总体而言，我们的FLAMe变体在12个自动评分器评估基准中的8个（涵盖53个质量评估任务，包括RewardBench和LLM-AggreFact）上优于所有我们考虑的主流专有LLM-as-a-Judge模型。最后，我们的分析表明，在CoBBLEr自动评分器偏差基准上，FLAMe比这些LLM-as-a-Judge模型的偏差显著更低，同时能有效识别代码生成中的高质量响应。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日