Crossing the Threshold: Idiomatic Machine Translation through Retrieval Augmentation and Loss Weighting

Idioms are common in everyday language, but often pose a challenge to translators because their meanings do not follow from the meanings of their parts. Despite significant advances, machine translation systems still struggle to translate idiomatic expressions. We provide a simple characterization of idiomatic translation and related issues. This allows us to conduct a synthetic experiment revealing a tipping point at which transformer-based machine translation models correctly default to idiomatic translations. To expand multilingual resources, we compile a dataset of ~4k natural sentences containing idiomatic expressions in French, Finnish, and Japanese. To improve translation of natural idioms, we introduce two straightforward yet effective techniques: the strategic upweighting of training loss on potentially idiomatic sentences, and using retrieval-augmented models. This not only improves the accuracy of a strong pretrained MT model on idiomatic sentences by up to 13% in absolute accuracy, but also holds potential benefits for non-idiomatic sentences.

翻译：习语在日常语言中十分常见，但由于其整体意义无法从组成部分的意义推导得出，往往给翻译者带来挑战。尽管机器翻译系统取得了显著进展，但在习语表达式的翻译上仍存在困难。我们提供了习语翻译及其相关问题的简单特征描述，并据此开展合成实验，揭示了基于Transformer的机器翻译模型正确转向习语翻译的临界点。为扩展多语言资源，我们构建了包含约4000条法语、芬兰语和日语习语表达式的自然语句数据集。为改进自然习语的翻译，我们引入了两种直接有效的技术：对潜在习语句子的训练损失进行战略性加权，以及使用检索增强模型。这不仅使强预训练机器翻译模型在习语句子上的绝对准确率提升高达13%，对非习语句子也可能带来潜在收益。

相关内容

Machine Translation

关注 210

机器翻译（Machine Translation）涵盖计算语言学和语言工程的所有分支，包含多语言方面。特色论文涵盖理论，描述或计算方面的任何下列主题:双语和多语语料库的编写和使用，计算机辅助语言教学，非罗马字符集的计算含义，连接主义翻译方法，对比语言学等。官网地址：http://dblp.uni-trier.de/db/journals/mt/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日