Large Language Models (LLMs) have achieved remarkable results in the machine translation evaluation task, yet there remains a gap in knowledge regarding how they utilize the provided data to conduct evaluations. This study aims to explore how LLMs leverage source and reference information in evaluating translations, with the ultimate goal of better understanding the working mechanism of LLMs. To this end, we design the controlled experiments across various input modes and model types, and employ both coarse-grained and fine-grained prompts to discern the utility of source versus reference information. Surprisingly, we find that reference information significantly enhances the evaluation accuracy, while source information sometimes is counterproductive, indicating a lack of cross-lingual capability when using LLMs to evaluate translations. We further conduct a meta-evaluation for translation error detection of LLMs, observing a similar phenomenon. These findings also suggest a potential research direction for LLMs that fully exploits the cross-lingual capability of LLMs to achieve better performance in machine translation evaluation tasks.
翻译:大型语言模型(LLMs)在机器翻译评估任务中取得了显著成果,但关于它们如何利用所提供数据进行评估的知识仍存在空白。本研究旨在探索LLMs在评估翻译时如何利用源语言和参考信息,最终目标是更好地理解LLMs的工作机制。为此,我们针对不同输入模式和模型类型设计了受控实验,并采用粗粒度和细粒度提示来区分源语言信息与参考信息的效用。令人惊讶的是,我们发现参考信息能显著提升评估准确性,而源语言信息有时反而产生反效果,这表明使用LLMs评估翻译时存在跨语言能力的不足。我们进一步对LLMs的翻译错误检测进行了元评估,观察到类似现象。这些发现也指出了LLMs的一个潜在研究方向,即充分挖掘其跨语言能力以在机器翻译评估任务中实现更优性能。