Discourse phenomena in existing document-level translation datasets are sparse, which has been a fundamental obstacle in the development of context-aware machine translation models. Moreover, most existing document-level corpora and context-aware machine translation methods rely on an unrealistic assumption on sentence-level alignments. To mitigate these issues, we first curate a novel dataset of Chinese-English literature, which consists of 160 books with intricate discourse structures. Then, we propose a more pragmatic and challenging setting for context-aware translation, termed chapter-to-chapter (Ch2Ch) translation, and investigate the performance of commonly-used machine translation models under this setting. Furthermore, we introduce a potential approach of finetuning large language models (LLMs) within the domain of Ch2Ch literary translation, yielding impressive improvements over baselines. Through our comprehensive analysis, we unveil that literary translation under the Ch2Ch setting is challenging in nature, with respect to both model learning methods and translation decoding algorithms.
翻译:现有文档级翻译数据集中的语篇现象较为稀疏,这已成为制约上下文感知机器翻译模型发展的根本障碍。此外,大多数现有文档级语料库及上下文感知机器翻译方法均依赖于句子级对齐这一不切实际的假设。为缓解这些问题,我们首先构建了一个新颖的中英文学数据集,该数据集包含160部具有复杂语篇结构的书籍。随后,我们提出了一种更实用且更具挑战性的上下文感知翻译设定——章节到章节(Ch2Ch)翻译,并在此设定下评估了常用机器翻译模型的性能。进一步地,我们提出了一种在Ch2Ch文学翻译领域微调大语言模型(LLMs)的潜在方法,相较于基线模型取得了显著提升。通过综合分析,我们揭示了Ch2Ch设定下的文学翻译本质上具有挑战性,这种挑战性既体现在模型学习方法上,也反映在翻译解码算法中。