Existing multimodal machine translation (MMT) datasets consist of images and video captions or instructional video subtitles, which rarely contain linguistic ambiguity, making visual information ineffective in generating appropriate translations. Recent work has constructed an ambiguous subtitles dataset to alleviate this problem but is still limited to the problem that videos do not necessarily contribute to disambiguation. We introduce EVA (Extensive training set and Video-helpful evaluation set for Ambiguous subtitles translation), an MMT dataset containing 852k Japanese-English (Ja-En) parallel subtitle pairs, 520k Chinese-English (Zh-En) parallel subtitle pairs, and corresponding video clips collected from movies and TV episodes. In addition to the extensive training set, EVA contains a video-helpful evaluation set in which subtitles are ambiguous, and videos are guaranteed helpful for disambiguation. Furthermore, we propose SAFA, an MMT model based on the Selective Attention model with two novel methods: Frame attention loss and Ambiguity augmentation, aiming to use videos in EVA for disambiguation fully. Experiments on EVA show that visual information and the proposed methods can boost translation performance, and our model performs significantly better than existing MMT models. The EVA dataset and the SAFA model are available at: https://github.com/ku-nlp/video-helpful-MMT.git.
翻译:现有的大多数多模态机器翻译(MMT)数据集包含图像与视频字幕或教学视频的文本,但其中很少包含语言歧义问题,导致视觉信息在生成准确翻译时效果有限。近期研究构建了歧义字幕数据集以缓解该问题,但仍受限于视频未必有助于消歧的困境。我们提出EVA(面向歧义字幕翻译的大规模训练集与视频辅助评估集),该MMT数据集包含85.2万对日-英平行字幕对、52万对中-英平行字幕对,以及从电影和电视剧中采集的对应视频片段。除大规模训练集外,EVA还包含一个视频辅助评估集,其中字幕具有歧义性,且视频被保证有助于消歧。此外,我们提出基于选择性注意力机制的SAFA模型,包含两种创新方法:帧注意力损失与歧义增强,旨在充分利用EVA中的视频信息进行消歧。在EVA上的实验表明,视觉信息及所提方法能提升翻译性能,且我们的模型显著优于现有MMT模型。EVA数据集与SAFA模型已开源:https://github.com/ku-nlp/video-helpful-MMT.git。