Leveraging large language models (LLMs) for various natural language processing tasks has led to superlative claims about their performance. For the evaluation of machine translation (MT), existing research shows that LLMs are able to achieve results comparable to fine-tuned multilingual pre-trained language models. In this paper, we explore what translation information, such as the source, reference, translation errors and annotation guidelines, is needed for LLMs to evaluate MT quality. In addition, we investigate prompting techniques such as zero-shot, Chain of Thought (CoT) and few-shot prompting for eight language pairs covering high-, medium- and low-resource languages, leveraging varying LLM variants. Our findings indicate the importance of reference translations for an LLM-based evaluation. While larger models do not necessarily fare better, they tend to benefit more from CoT prompting, than smaller models. We also observe that LLMs do not always provide a numerical score when generating evaluations, which poses a question on their reliability for the task. Our work presents a comprehensive analysis for resource-constrained and training-less LLM-based evaluation of machine translation. We release the accrued prompt templates, code and data publicly for reproducibility.
翻译:利用大型语言模型(LLMs)处理各种自然语言处理任务,引发了关于其性能的卓越论断。在机器翻译(MT)评估方面,现有研究表明,LLMs能够取得与经过微调的多语言预训练语言模型相当的结果。本文探讨了LLMs评估机器翻译质量需要哪些翻译信息,例如源文本、参考译文、翻译错误和标注指南。此外,我们研究了针对涵盖高资源、中资源和低资源语言的八个语言对,利用不同LLM变体时的提示技术,包括零样本提示、思维链(CoT)提示和少样本提示。我们的研究结果表明,参考译文对于基于LLM的评估至关重要。虽然更大的模型不一定表现更好,但它们往往比更小的模型更能从CoT提示中受益。我们还观察到,LLMs在生成评估时并不总是提供数值分数,这对它们在该任务中的可靠性提出了疑问。我们的工作为资源受限且无需训练的基于LLM的机器翻译评估提供了全面分析。我们公开发布了积累的提示模板、代码和数据,以确保可复现性。