Generative large language models (LLMs), e.g., ChatGPT, have demonstrated remarkable proficiency across several NLP tasks such as machine translation, question answering, text summarization, and natural language understanding. Recent research has shown that utilizing ChatGPT for assessing the quality of machine translation (MT) achieves state-of-the-art performance at the system level but performs poorly at the segment level. To further improve the performance of LLMs on MT quality assessment, we conducted an investigation into several prompting methods. Our results indicate that by combining Chain-of-Thoughts and Error Analysis, a new prompting method called \textbf{\texttt{Error Analysis Prompting}}, LLMs like ChatGPT can \textit{generate human-like MT evaluations at both the system and segment level}. Additionally, we discovered some limitations of ChatGPT as an MT evaluator, such as unstable scoring and biases when provided with multiple translations in a single query. Our findings aim to provide a preliminary experience for appropriately evaluating translation quality on ChatGPT while offering a variety of tricks in designing prompts for in-context learning. We anticipate that this report will shed new light on advancing the field of translation evaluation with LLMs by enhancing both the accuracy and reliability of metrics. The project can be found in \url{https://github.com/Coldmist-Lu/ErrorAnalysis_Prompt}.
翻译:生成式大语言模型(LLMs),如ChatGPT,已在机器翻译、问答系统、文本摘要和自然语言理解等多项NLP任务中展现出卓越能力。近期研究表明,利用ChatGPT评估机器翻译质量时,虽在系统层面达到最优性能,但在段落层面表现欠佳。为提升LLMs在机器翻译质量评估中的表现,我们探究了多种提示方法。实验结果表明,通过结合思维链与错误分析,我们提出名为\textbf{\texttt{错误分析提示}}的新方法,能使ChatGPT等LLMs在系统层面和段落层面\textit{生成类人机器翻译评估}。此外,我们发现了ChatGPT作为翻译评估器的若干局限性,例如在单次查询中提供多个翻译时存在评分不稳定与偏差问题。本文研究旨在为合理评估ChatGPT的翻译质量提供初步经验,同时提供设计情境学习提示的多种技巧。我们期待此项工作通过提升评估指标的准确性与可靠性,为推进基于LLMs的翻译评估领域发展提供新视角。项目代码请参阅\url{https://github.com/Coldmist-Lu/ErrorAnalysis_Prompt}。