In machine translation, a common problem is that the translation of certain words even if translated can cause incomprehension of the target language audience due to different cultural backgrounds. A solution to solve this problem is to add explanations for these words. In a first step, we therefore need to identify these words or phrases. In this work we explore techniques to extract example explanations from a parallel corpus. However, the sparsity of sentences containing words that need to be explained makes building the training dataset extremely difficult. In this work, we propose a semi-automatic technique to extract these explanations from a large parallel corpus. Experiments on English->German language pair show that our method is able to extract sentence so that more than 10% of the sentences contain explanation, while only 1.9% of the original sentences contain explanations. In addition, experiments on English->French and English->Chinese language pairs also show similar conclusions. This is therefore an essential first automatic step to create a explanation dataset. Furthermore we show that the technique is robust for all three language pairs.
翻译:在机器翻译中,一个常见问题是某些词语即使被翻译,由于文化背景差异仍可能导致目标语言受众理解困难。解决该问题的一种方法是为这些词语添加解释。因此,我们首先需要识别这些词语或短语。本研究探索了从平行语料库中提取示例解释的技术。然而,包含需要解释词语的句子稀疏性使得构建训练数据集极为困难。本文提出了一种半自动技术,可从大规模平行语料库中提取这些解释。在英语→德语语言对上的实验表明,我们的方法能够提取句子,使得超过10%的句子包含解释,而原始句子中仅1.9%包含解释。此外,在英语→法语和英语→中文语言对上的实验也得出了类似结论。因此,这是创建解释数据集的关键首个自动步骤。进一步地,我们证明了该技术对所有三种语言对均具有鲁棒性。