While Minimum Bayes Risk (MBR) decoding using metrics such as COMET or MetricX has outperformed traditional decoding methods such as greedy or beam search, it introduces a challenge we refer to as metric bias. As MBR decoding aims to produce translations that score highly according to a specific utility metric, this very process makes it impossible to use the same metric for both decoding and evaluation, as improvements might simply be due to reward hacking rather than reflecting real quality improvements. In this work we find that compared to human ratings, neural metrics not only overestimate the quality of MBR decoding when the same metric is used as the utility metric, but they also overestimate the quality of MBR/QE decoding with other neural utility metrics as well. We also show that the metric bias issue can be mitigated by using an ensemble of utility metrics during MBR decoding: human evaluations show that MBR decoding using an ensemble of utility metrics outperforms a single utility metric.
翻译:尽管使用COMET或MetricX等度量指标的最小贝叶斯风险(MBR)解码已超越贪心搜索或束搜索等传统解码方法,但它引入了我们称之为度量偏差的挑战。由于MBR解码旨在生成能根据特定效用度量获得高分的翻译,这一过程本身使得无法使用同一度量同时进行解码和评估,因为改进可能仅源于奖励破解而非反映真实的质量提升。本研究发现,与人工评分相比,当同一神经度量被用作效用度量时,神经度量不仅会高估MBR解码的质量,而且在使用其他神经效用度量的MBR/QE解码中也会出现高估现象。我们还证明,通过在MBR解码过程中使用效用度量集成可以缓解度量偏差问题:人工评估表明,使用效用度量集成的MBR解码优于单一效用度量方案。