Generative Pre-trained Transformer (GPT) models have shown remarkable capabilities for natural language generation, but their performance for machine translation has not been thoroughly investigated. In this paper, we present a comprehensive evaluation of GPT models for machine translation, covering various aspects such as quality of different GPT models in comparison with state-of-the-art research and commercial systems, effect of prompting strategies, robustness towards domain shifts and document-level translation. We experiment with eighteen different translation directions involving high and low resource languages, as well as non English-centric translations, and evaluate the performance of three GPT models: ChatGPT, GPT3.5 (text-davinci-003), and text-davinci-002. Our results show that GPT models achieve very competitive translation quality for high resource languages, while having limited capabilities for low resource languages. We also show that hybrid approaches, which combine GPT models with other translation systems, can further enhance the translation quality. We perform comprehensive analysis and human evaluation to further understand the characteristics of GPT translations. We hope that our paper provides valuable insights for researchers and practitioners in the field and helps to better understand the potential and limitations of GPT models for translation.
翻译:生成式预训练Transformer(GPT)模型在自然语言生成方面展现出卓越能力,但其在机器翻译任务上的性能尚未得到深入探究。本文对GPT模型的机器翻译能力进行了全面评估,涵盖多个维度,包括不同GPT模型与最新研究及商业系统的质量对比、提示策略的影响、领域迁移鲁棒性以及文档级翻译。我们针对涉及高资源与低资源语言、非英语中心翻译的十八种不同翻译方向进行了实验,并评估了三种GPT模型的表现:ChatGPT、GPT3.5(text-davinci-003)和text-davinci-002。结果表明,GPT模型在高资源语言上实现了极具竞争力的翻译质量,但在低资源语言上能力有限。我们还发现,结合GPT模型与其他翻译系统的混合方法可进一步提升翻译质量。通过全面分析与人工评估,我们进一步理解了GPT翻译的特性。希望本文能为领域内的研究人员与实践者提供宝贵见解,并有助于更深入地认识GPT模型在翻译任务中的潜力与局限性。