Towards Chapter-to-Chapter Context-Aware Literary Translation via Large Language Models

Discourse phenomena in existing document-level translation datasets are sparse, which has been a fundamental obstacle in the development of context-aware machine translation models. Moreover, most existing document-level corpora and context-aware machine translation methods rely on an unrealistic assumption on sentence-level alignments. To mitigate these issues, we first curate a novel dataset of Chinese-English literature, which consists of 160 books with intricate discourse structures. Then, we propose a more pragmatic and challenging setting for context-aware translation, termed chapter-to-chapter (Ch2Ch) translation, and investigate the performance of commonly-used machine translation models under this setting. Furthermore, we introduce a potential approach of finetuning large language models (LLMs) within the domain of Ch2Ch literary translation, yielding impressive improvements over baselines. Through our comprehensive analysis, we unveil that literary translation under the Ch2Ch setting is challenging in nature, with respect to both model learning methods and translation decoding algorithms.

翻译：现有文档级翻译数据集中的语篇现象较为稀疏，这已成为制约上下文感知机器翻译模型发展的根本障碍。此外，大多数现有文档级语料库及上下文感知机器翻译方法均依赖于句子级对齐这一不切实际的假设。为缓解这些问题，我们首先构建了一个新颖的中英文学数据集，该数据集包含160部具有复杂语篇结构的书籍。随后，我们提出了一种更实用且更具挑战性的上下文感知翻译设定——章节到章节（Ch2Ch）翻译，并在此设定下评估了常用机器翻译模型的性能。进一步地，我们提出了一种在Ch2Ch文学翻译领域微调大语言模型（LLMs）的潜在方法，相较于基线模型取得了显著提升。通过综合分析，我们揭示了Ch2Ch设定下的文学翻译本质上具有挑战性，这种挑战性既体现在模型学习方法上，也反映在翻译解码算法中。

相关内容

Machine Translation

关注 210

机器翻译（Machine Translation）涵盖计算语言学和语言工程的所有分支，包含多语言方面。特色论文涵盖理论，描述或计算方面的任何下列主题:双语和多语语料库的编写和使用，计算机辅助语言教学，非罗马字符集的计算含义，连接主义翻译方法，对比语言学等。官网地址：http://dblp.uni-trier.de/db/journals/mt/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日