Large Language Models (LLMs) have significantly advanced Machine Translation (MT), applying them to linguistically complex domains-such as Social Network Services, literature etc. In these scenarios, translations often require handling non-literal expressions, leading to the inaccuracy of MT metrics. To systematically investigate the reliability of MT metrics, we first curate a meta-evaluation dataset focused on non-literal translations, namely MENT. MENT encompasses four non-literal translation domains and features source sentences paired with translations from diverse MT systems, with 7,530 human-annotated scores on translation quality. Experimental results reveal the inaccuracies of traditional MT metrics and the limitations of LLM-as-a-Judge, particularly the knowledge cutoff and score inconsistency problem. To mitigate these limitations, we propose RATE, a novel agentic translation evaluation framework, centered by a reflective Core Agent that dynamically invokes specialized sub-agents. Experimental results indicate the efficacy of RATE, achieving an improvement of at least 3.2 meta score compared with current metrics. Further experiments demonstrate the robustness of RATE to general-domain MT evaluation. Code and dataset are available at: https://github.com/BITHLP/RATE.
翻译:大型语言模型(LLM)显著推动了机器翻译(MT)的发展,使其能够应用于语言复杂度较高的领域——例如社交网络服务、文学等。在这些场景中,翻译常常需要处理非字面表达,导致现有MT评估指标失准。为系统研究MT评估指标的可靠性,我们首先构建了一个专注于非字面翻译的元评估数据集,命名为MENT。MENT涵盖四个非字面翻译领域,其特点是将源语句与来自不同MT系统的译文配对,并包含7,530个人工标注的翻译质量分数。实验结果表明,传统MT评估指标存在不准确性,而LLM-as-a-Judge方法也存在局限性,特别是知识截止问题和评分不一致问题。为缓解这些局限,我们提出了RATE,一种新颖的智能体翻译评估框架,其核心是一个反思型核心智能体,能够动态调用专用子智能体。实验结果表明RATE的有效性,与现有评估指标相比,其元评分至少提升了3.2分。进一步的实验证明了RATE在通用领域MT评估中的鲁棒性。代码与数据集已公开于:https://github.com/BITHLP/RATE。