Multimodal machine translation (MMT) is a challenging task that seeks to improve translation quality by incorporating visual information. However, recent studies have indicated that the visual information provided by existing MMT datasets is insufficient, causing models to disregard it and overestimate their capabilities. This issue presents a significant obstacle to the development of MMT research. This paper presents a novel solution to this issue by introducing 3AM, an ambiguity-aware MMT dataset comprising 26,000 parallel sentence pairs in English and Chinese, each with corresponding images. Our dataset is specifically designed to include more ambiguity and a greater variety of both captions and images than other MMT datasets. We utilize a word sense disambiguation model to select ambiguous data from vision-and-language datasets, resulting in a more challenging dataset. We further benchmark several state-of-the-art MMT models on our proposed dataset. Experimental results show that MMT models trained on our dataset exhibit a greater ability to exploit visual information than those trained on other MMT datasets. Our work provides a valuable resource for researchers in the field of multimodal learning and encourages further exploration in this area. The data, code and scripts are freely available at https://github.com/MaxyLee/3AM.
翻译:多模态机器翻译(MMT)是一项具有挑战性的任务,旨在通过引入视觉信息提升翻译质量。然而,近期研究表明,现有MMT数据集提供的视觉信息不足,导致模型忽视这些信息并高估自身能力。这一问题严重阻碍了MMT研究的发展。本文提出一种创新解决方案,即3AM——一个对歧义敏感的MMT数据集,包含26,000对中英文平行句子对,每对均配有对应图像。与其他MMT数据集相比,本数据集特别注重包含更多歧义性以及更丰富的图文多样性。我们利用词义消歧模型从视觉-语言数据集中筛选歧义数据,从而构建更具挑战性的数据集。进一步地,我们在所提数据集上对多个最先进的MMT模型进行了基准测试。实验结果表明,基于本数据集训练的MMT模型在利用视觉信息方面优于基于其他MMT数据集训练的模型。本研究为多模态学习领域的研究人员提供了宝贵资源,并鼓励该方向的深入探索。相关数据、代码及脚本已开源:https://github.com/MaxyLee/3AM。