When developing new large language models (LLMs), a key step is evaluating their final performance, often by computing the win-rate against a reference model based on external feedback. Human feedback is the gold standard, particularly for capturing nuanced qualities like coherence, readability, and alignment with human expectations. However, human evaluations are costly -- even for large tech companies -- and when conducted with active users, they may negatively impact user experience. A promising alternative is synthetic feedback, where evaluations are conducted by other large language models, including reward models. While this eliminates the need for costly human annotations, it introduces biases that may distort the evaluation process. In this work, we propose a statistically principled framework that integrates human and synthetic feedback to reduce reliance on human annotations while maintaining unbiased win-rate calculations. Our experiments demonstrate a reduction in human annotations by up to 12.2% with an off-the-shelf synthetic evaluator and up to 24.8% with a finetuned variant. Apart from being generalizable, scalable, and free of hyper-parameter tuning, our method offers predictable annotation savings, which can be estimated based on data-dependent characteristics.
翻译:在开发新的大语言模型时,关键步骤之一是评估其最终性能,通常通过基于外部反馈计算其相对于参考模型的胜率来实现。人类反馈是黄金标准,尤其在捕捉诸如连贯性、可读性及与人类期望的一致性等细微品质方面。然而,人类评估成本高昂——即使对大型科技公司而言也是如此——并且当与活跃用户一起进行时,可能会对用户体验产生负面影响。一种有前景的替代方案是合成反馈,即由其他大语言模型(包括奖励模型)进行评估。虽然这消除了对昂贵人工标注的需求,但也引入了可能扭曲评估过程的偏差。在本工作中,我们提出了一个基于统计原理的框架,该框架整合了人类与合成反馈,以减少对人类标注的依赖,同时保持无偏的胜率计算。我们的实验表明,使用现成的合成评估器可将人工标注减少高达12.2%,而使用微调变体则可减少高达24.8%。除了具有可推广性、可扩展性且无需超参数调优外,我们的方法还能提供可预测的标注节省,这可以根据数据相关特性进行估算。