Neural Machine Translation (NMT) is the task of translating a text from one language to another with the use of a trained neural network. Several existing works aim at incorporating external information into NMT models to improve or control predicted translations (e.g. sentiment, politeness, gender). In this work, we propose to improve translation quality by adding another external source of information: the automatically recognized emotion in the voice. This work is motivated by the assumption that each emotion is associated with a specific lexicon that can overlap between emotions. Our proposed method follows a two-stage procedure. At first, we select a state-of-the-art Speech Emotion Recognition (SER) model to predict dimensional emotion values from all input audio in the dataset. Then, we use these predicted emotions as source tokens added at the beginning of input texts to train our NMT model. We show that integrating emotion information, especially arousal, into NMT systems leads to better translations.
翻译:神经机器翻译(NMT)是利用训练好的神经网络将文本从一种语言翻译成另一种语言的任务。已有若干研究工作致力于将外部信息融入NMT模型,以改进或控制预测的翻译结果(例如情感、礼貌程度、性别)。本研究提出通过添加另一种外部信息源——语音中自动识别的情感——来提升翻译质量。该工作的动机基于一个假设:每种情感都与特定的词汇库相关联,且这些词汇库在不同情感间可能存在重叠。我们提出的方法遵循两阶段流程:首先,选取一种最先进的语音情感识别(SER)模型,用于预测数据集中所有输入音频的维度情感值;然后,将这些预测出的情感作为源语言标记,添加到输入文本的开头,以训练NMT模型。实验证明,将情感信息(尤其是唤醒度)整合到NMT系统中,能够获得更优质的翻译结果。