Large language models (LLMs) such as ChatGPT can produce coherent, cohesive, relevant, and fluent answers for various natural language processing (NLP) tasks. Taking document-level machine translation (MT) as a testbed, this paper provides an in-depth evaluation of LLMs' ability on discourse modeling. The study focuses on three aspects: 1) Effects of Context-Aware Prompts, where we investigate the impact of different prompts on document-level translation quality and discourse phenomena; 2) Comparison of Translation Models, where we compare the translation performance of ChatGPT with commercial MT systems and advanced document-level MT methods; 3) Analysis of Discourse Modelling Abilities, where we further probe discourse knowledge encoded in LLMs and shed light on impacts of training techniques on discourse modeling. By evaluating on a number of benchmarks, we surprisingly find that LLMs have demonstrated superior performance and show potential to become a new paradigm for document-level translation: 1) leveraging their powerful long-text modeling capabilities, GPT-3.5 and GPT-4 outperform commercial MT systems in terms of human evaluation; 2) GPT-4 demonstrates a stronger ability for probing linguistic knowledge than GPT-3.5. This work highlights the challenges and opportunities of LLMs for MT, which we hope can inspire the future design and evaluation of LLMs.We release our data and annotations at https://github.com/longyuewangdcu/Document-MT-LLM.
翻译:大语言模型(如ChatGPT)能够为各类自然语言处理任务生成连贯、一致、相关且流畅的答案。本文以文档级机器翻译为测试平台,深入评估大语言模型在语篇建模方面的能力。研究聚焦于三个方面:1)上下文感知提示的影响,探讨不同提示对文档级翻译质量和语篇现象的作用;2)翻译模型对比,比较ChatGPT与商业机器翻译系统及先进文档级翻译方法的性能;3)语篇建模能力分析,进一步探究大语言模型中编码的语篇知识,并揭示训练技术对语篇建模的影响。通过在多个基准数据集上评估,我们惊奇地发现:1)大语言模型凭借其强大的长文本建模能力,GPT-3.5和GPT-4在人工评估中优于商业机器翻译系统;2)GPT-4在挖掘语言知识方面展现出比GPT-3.5更强的能力。本研究揭示了基于大语言模型的机器翻译面临的挑战与机遇,期望能启发未来大语言模型的设计与评估。我们在https://github.com/longyuewangdcu/Document-MT-LLM 开源了数据与标注。