The study investigates the effectiveness of utilizing multimodal information in Neural Machine Translation (NMT). While prior research focused on using multimodal data in low-resource scenarios, this study examines how image features impact translation when added to a large-scale, pre-trained unimodal NMT system. Surprisingly, the study finds that images might be redundant in this context. Additionally, the research introduces synthetic noise to assess whether images help the model deal with textual noise. Multimodal models slightly outperform text-only models in noisy settings, even with random images. The study's experiments translate from English to Hindi, Bengali, and Malayalam, outperforming state-of-the-art benchmarks significantly. Interestingly, the effect of visual context varies with source text noise: no visual context works best for non-noisy translations, cropped image features are optimal for low noise, and full image features work better in high-noise scenarios. This sheds light on the role of visual context, especially in noisy settings, opening up a new research direction for Noisy Neural Machine Translation in multimodal setups. The research emphasizes the importance of combining visual and textual information for improved translation in various environments.
翻译:本研究探讨了在多模态神经机器翻译(NMT)中利用多模态信息的有效性。尽管先前研究侧重于在低资源场景中使用多模态数据,本研究则考察了将图像特征添加到大规模预训练单模态NMT系统时对翻译的影响。令人惊讶的是,研究发现图像在此上下文中可能具有冗余性。此外,研究引入了合成噪声以评估图像是否有助于模型处理文本噪声。在噪声环境下,即使使用随机图像,多模态模型也略微优于纯文本模型。本实验将英语翻译为印地语、孟加拉语和马拉雅拉姆语,结果显著超越了现有最先进基准。有趣的是,视觉上下文的效果随源文本噪声变化:对于无噪声翻译,不使用视觉上下文效果最佳;在低噪声场景下,裁剪图像特征最优;而在高噪声场景中,完整图像特征表现更好。这一发现揭示了视觉上下文的作用,特别是在噪声环境中,为多模态框架下的噪声神经机器翻译开辟了新的研究方向。研究强调了结合视觉和文本信息以在不同环境下改进翻译的重要性。