As large language models (LLMs) advance, it becomes more challenging to reliably evaluate their output due to the high costs of human evaluation. To make progress towards better LLM autoraters, we introduce FLAMe, a family of Foundational Large Autorater Models. FLAMe is trained on our large and diverse collection of 100+ quality assessment tasks comprising 5M+ human judgments, curated and standardized using publicly released human evaluations from previous research. FLAMe significantly improves generalization to a wide variety of held-out tasks, outperforming LLMs trained on proprietary data like GPT-4 and Claude-3 on many tasks. We show that FLAMe can also serve as a powerful starting point for further downstream fine-tuning, using reward modeling evaluation as a case study (FLAMe-RM). Notably, on RewardBench, our FLAMe-RM-24B model (with an accuracy of 87.8%) is the top-performing generative model trained exclusively on permissively licensed data, outperforming both GPT-4-0125 (85.9%) and GPT-4o (84.7%). Additionally, we explore a more computationally efficient approach using a novel tail-patch fine-tuning strategy to optimize our FLAMe multitask mixture for reward modeling evaluation (FLAMe-Opt-RM), offering competitive RewardBench performance while requiring approximately 25x less training datapoints. Overall, our FLAMe variants outperform all popular proprietary LLM-as-a-Judge models we consider across 8 out of 12 autorater evaluation benchmarks, encompassing 53 quality assessment tasks, including RewardBench and LLM-AggreFact. Finally, our analysis reveals that FLAMe is significantly less biased than these LLM-as-a-Judge models on the CoBBLEr autorater bias benchmark, while effectively identifying high-quality responses for code generation.
翻译:随着大型语言模型(LLM)的进步,由于其输出的人工评估成本高昂,可靠评估其输出变得更具挑战性。为了推动更好的LLM自动评分器的发展,我们引入了FLAMe,一个基础大型自动评分模型系列。FLAMe在我们收集的大规模、多样化质量评估任务集(包含100多个任务,涵盖500多万个人类判断)上进行训练,这些数据通过整理和标准化先前研究中公开发布的人类评估结果而构建。FLAMe在广泛多样的保留任务上显著提高了泛化能力,在许多任务上超越了基于专有数据训练的LLM(如GPT-4和Claude-3)。我们证明FLAMe还可以作为下游进一步微调的强大起点,并以奖励建模评估作为案例研究(FLAMe-RM)。值得注意的是,在RewardBench上,我们的FLAMe-RM-24B模型(准确率为87.8%)是仅使用宽松许可数据训练的性能最佳的生成模型,超越了GPT-4-0125(85.9%)和GPT-4o(84.7%)。此外,我们探索了一种计算效率更高的方法,采用新颖的尾部补丁微调策略来优化我们的FLAMe多任务混合模型以用于奖励建模评估(FLAMe-Opt-RM),在RewardBench上提供了有竞争力的性能,同时所需训练数据点减少了约25倍。总体而言,我们的FLAMe变体在12个自动评分器评估基准中的8个(涵盖53个质量评估任务,包括RewardBench和LLM-AggreFact)上优于所有我们考虑的主流专有LLM-as-a-Judge模型。最后,我们的分析表明,在CoBBLEr自动评分器偏差基准上,FLAMe比这些LLM-as-a-Judge模型的偏差显著更低,同时能有效识别代码生成中的高质量响应。