Back Translation (BT) is widely used in the field of machine translation, as it has been proved effective for enhancing translation quality. However, BT mainly improves the translation of inputs that share a similar style (to be more specific, translation-like inputs), since the source side of BT data is machine-translated. For natural inputs, BT brings only slight improvements and sometimes even adverse effects. To address this issue, we propose Text Style Transfer Back Translation (TST BT), which uses a style transfer model to modify the source side of BT data. By making the style of source-side text more natural, we aim to improve the translation of natural inputs. Our experiments on various language pairs, including both high-resource and low-resource ones, demonstrate that TST BT significantly improves translation performance against popular BT benchmarks. In addition, TST BT is proved to be effective in domain adaptation so this strategy can be regarded as a general data augmentation method. Our training code and text style transfer model are open-sourced.
翻译:反向翻译(Back Translation, BT)在机器翻译领域被广泛采用,因其已被证明能有效提升翻译质量。然而,反向翻译主要改善与训练数据风格相似(具体而言,类似机器翻译输入)的译入文本质量,这是因为反向翻译数据的源端由机器生成。对于自然语言输入,反向翻译仅能带来轻微改进,甚至有时会产生负面影响。为解决这一问题,我们提出文本风格迁移反向翻译(Text Style Transfer Back Translation, TST BT),该方法利用风格迁移模型修改反向翻译数据的源端。通过使源端文本风格更趋自然,我们旨在提升自然语言输入的翻译质量。我们在涵盖高资源与低资源语言对的多种语言组合上开展实验,结果表明TST BT相较于主流反向翻译基准方法显著提升了翻译性能。此外,TST BT在领域适应中也展现出有效性,因此该策略可被视为一种通用数据增强方法。我们的训练代码及文本风格迁移模型已开源。