Large language models (LLMs) such as Chat-GPT can produce coherent, cohesive, relevant, and fluent answers for various natural language processing (NLP) tasks. Taking document-level machine translation (MT) as a testbed, this paper provides an in-depth evaluation of LLMs' ability on discourse modeling. The study fo-cuses on three aspects: 1) Effects of Discourse-Aware Prompts, where we investigate the impact of different prompts on document-level translation quality and discourse phenomena; 2) Comparison of Translation Models, where we compare the translation performance of Chat-GPT with commercial MT systems and advanced document-level MT methods; 3) Analysis of Discourse Modelling Abilities, where we further probe discourse knowledge encoded in LLMs and examine the impact of training techniques on discourse modeling. By evaluating a number of benchmarks, we surprisingly find that 1) leveraging their powerful long-text mod-eling capabilities, ChatGPT outperforms commercial MT systems in terms of human evaluation. 2) GPT-4 demonstrates a strong ability to explain discourse knowledge, even through it may select incorrect translation candidates in contrastive testing. 3) ChatGPT and GPT-4 have demonstrated superior performance and show potential to become a new and promising paradigm for document-level translation. This work highlights the challenges and opportunities of discourse modeling for LLMs, which we hope can inspire the future design and evaluation of LLMs.
翻译:大型语言模型(如Chat-GPT)能够为各类自然语言处理任务生成连贯、一致、相关且流畅的答案。本文以篇章级机器翻译为测试平台,深入评估了大语言模型在语篇建模方面的能力。研究聚焦于三个方面:1)语篇感知提示的影响——探究不同提示对篇章级翻译质量及语篇现象的作用;2)翻译模型对比——比较Chat-GPT与商业机器翻译系统及先进篇章级翻译方法的性能;3)语篇建模能力分析——进一步探查大语言模型中编码的语篇知识,并检验训练技术对语篇建模的影响。通过对多个基准测试的评估,我们惊喜地发现:1)凭借其强大的长文本建模能力,Chat-GPT在人工评估上优于商业机器翻译系统;2)GPT-4展现出较强的语篇知识解释能力,尽管在对比测试中可能选择错误的翻译候选;3)Chat-GPT与GPT-4已展现出卓越性能,并具备成为篇章级翻译新范式的潜力。本工作揭示了大语言模型在语篇建模中的挑战与机遇,期望能为未来大语言模型的设计与评估提供启示。