Beyond Scalar Scores: Reinforcement Learning for Error-Aware Quality Estimation of Machine Translation

Quality Estimation (QE) aims to assess the quality of machine translation (MT) outputs without relying on reference translations, making it essential for real-world, large-scale MT evaluation. Large Language Models (LLMs) have shown significant promise in advancing the field of quality estimation of machine translation. However, most of the QE approaches solely rely on scalar quality scores, offering no explicit information about the translation errors that should drive these judgments. Moreover, for low-resource languages where annotated QE data is limited, existing approaches struggle to achieve reliable performance. To address these challenges, we introduce the first segment-level QE dataset for English to Malayalam, a severely resource-scarce language pair in the QE domain, comprising human-annotated Direct Assessment (DA) scores and Translation Quality Remarks (TQR), which are short, contextual, free-form annotator comments that describe translation errors. We further introduce ALOPE-RL, a policy-based reinforcement learning framework that trains efficient adapters based on policy rewards derived from DA score and TQR. Integrating error-aware rewards with ALOPE-RL, enables LLMs to reason about translation quality beyond numeric scores. Despite being trained on a small-scale QE dataset, ALOPE-RL achieves state-of-the-art performance on English to Malayalam QE using compact LLMs (<=4B parameters}) fine-tuned with LoRA and 4-bit quantization, outperforming both larger LLM-based baselines and leading encoder-based QE models. Our results demonstrate that error-aware, policy-based learning can deliver strong QE performance under limited data and compute budgets. We release our dataset, code, and trained models to support future research.

翻译：质量评估(QE)旨在无需参考译文的情况下评估机器翻译(MT)输出的质量，这对现实世界大规模机器翻译评估至关重要。大语言模型(LLM)在推进机器翻译质量评估领域已展现出显著潜力。然而，大多数质量评估方法仅依赖标量质量分数，未能提供驱动这些判断的翻译错误的明确信息。此外，对于标注质量评估数据有限的低资源语言，现有方法难以实现可靠性能。为应对这些挑战，我们首次构建了英语到马拉雅拉姆语（质量评估领域中资源极度匮乏的语言对）的句段级质量评估数据集，包含人工标注的直接评估(DA)分数和翻译质量评注(TQR)——即描述翻译错误的简短、语境化、自由格式的标注者评论。我们进一步提出ALOPE-RL，这是一种基于策略的强化学习框架，通过从直接评估分数和翻译质量评注推导的策略奖励来训练高效适配器。将错误感知奖励与ALOPE-RL相结合，使大语言模型能够超越数值分数进行翻译质量推理。尽管在小规模质量评估数据集上训练，ALOPE-RL在使用经LoRA微调和4位量化的紧凑型大语言模型(<=40亿参数)进行英语到马拉雅拉姆语质量评估时，仍取得了最先进的性能，超越了基于更大规模大语言模型的基线方法和领先的基于编码器的质量评估模型。我们的结果表明，在有限数据和计算预算下，错误感知的基于策略的学习能够提供强大的质量评估性能。我们公开了数据集、代码和训练模型以支持未来研究。