Recent advances in large language models (LLMs) show the potential of using LLMs as evaluators for assessing the quality of text generations from LLMs. However, applying LLM evaluators naively to compare or judge between different systems can lead to unreliable results due to the intrinsic win rate estimation bias of LLM evaluators. In order to mitigate this problem, we propose two calibration methods, Bayesian Win Rate Sampling (BWRS) and Bayesian Dawid-Skene, both of which leverage Bayesian inference to more accurately infer the true win rate of generative language models. We empirically validate our methods on six datasets covering story generation, summarization, and instruction following tasks. We show that both our methods are effective in improving the accuracy of win rate estimation using LLMs as evaluators, offering a promising direction for reliable automatic text quality evaluation.
翻译:近期大型语言模型(LLM)的研究进展表明,利用LLM作为评估器来评判LLM文本生成质量具有潜力。然而,由于LLM评估器存在固有的胜率估计偏差,若直接将其用于不同系统间的比较或评判,可能导致不可靠的结果。为缓解此问题,我们提出了两种校准方法:贝叶斯胜率采样(BWRS)与贝叶斯Dawid-Skene模型。这两种方法均利用贝叶斯推断来更准确地推演生成式语言模型的真实胜率。我们在涵盖故事生成、文本摘要和指令跟随任务的六个数据集上进行了实证验证。实验表明,两种方法均能有效提升以LLM作为评估器时的胜率估计精度,为可靠的自动文本质量评估提供了可行方向。