One of the things that need to change when it comes to machine translation is the models' ability to translate code-switching content, especially with the rise of social media and user-generated content. In this paper, we are proposing a way of training a single machine translation model that is able to translate monolingual sentences from one language to another, along with translating code-switched sentences to either language. This model can be considered a bilingual model in the human sense. For better use of parallel data, we generated synthetic code-switched (CSW) data along with an alignment loss on the encoder to align representations across languages. Using the WMT14 English-French (En-Fr) dataset, the trained model strongly outperforms bidirectional baselines on code-switched translation while maintaining quality for non-code-switched (monolingual) data.
翻译:在机器翻译领域,亟需改进的一个方面是模型对代码混合内容的翻译能力,尤其是在社交媒体和用户生成内容日益普及的背景下。本文提出了一种训练单一机器翻译模型的方法,使其既能翻译一种语言到另一种语言的单语句子,也能将代码混合句子翻译成任一语言。该模型可视为人类意义上的双语模型。为了更有效地利用平行数据,我们生成了合成代码混合(CSW)数据,并在编码器上引入了对齐损失,以跨语言对齐表示。基于WMT14英法(En-Fr)数据集,训练后的模型在代码混合翻译任务上显著优于双向基线模型,同时保持了对非代码混合(单语)数据的翻译质量。