Quality Estimation (QE) metrics are vital in machine translation for reference-free evaluation and increasingly serve as selection criteria in data filtering and candidate reranking. However, the prevalence and impact of length bias in QE metrics have been underexplored. Through a systematic study of top-performing learned and LLM-as-a-Judge QE metrics across 10 diverse language pairs, we reveal two critical length biases: First, QE metrics consistently over-predict errors with increasing translation length, even for high-quality, error-free texts. Second, they exhibit a systematic preference for shorter translations when multiple candidates of comparable quality are available for the same source text. These biases risk unfairly penalizing longer, correct translations and can propagate into downstream pipelines that rely on QE signals for data selection or system optimization. We trace the root cause of learned QE metrics to skewed supervision distributions, where longer error-free examples are underrepresented in training data. As a diagnostic intervention, we apply length normalization during training and show that this simple modification effectively decouples error prediction from sequence length, yielding more reliable QE signals across translations of varying length.
翻译:质量估计(QE)指标在机器翻译中对于无参考评估至关重要,并日益成为数据过滤和候选重排中的选择标准。然而,长度偏差在QE指标中的普遍性和影响尚未得到充分探索。通过对10种不同语言对中表现最佳的基于学习和基于LLM-as-a-Judge的QE指标进行系统性研究,我们揭示了两种关键的长度偏差:首先,QE指标随着翻译长度的增加始终过度预测错误,即使对于高质量、无错误的文本也是如此。其次,当同一源文本有多个质量相当的可选译文时,它们表现出对较短译文的系统性偏好。这些偏差有可能不公平地惩罚较长、正确的翻译,并可能扩散到依赖QE信号进行数据选择或系统优化的下游流程中。我们将基于学习的QE指标的根本原因追溯到有偏的监督分布,其中较长的无错误样本在训练数据中代表性不足。作为诊断性干预措施,我们在训练期间应用长度归一化,并表明这一简单修改能有效将错误预测与序列长度解耦,从而在不同长度的翻译中产生更可靠的QE信号。