Current multimodal machine translation (MMT) systems rely on fully supervised data (i.e models are trained on sentences with their translations and accompanying images). However, this type of data is costly to collect, limiting the extension of MMT to other language pairs for which such data does not exist. In this work, we propose a method to bypass the need for fully supervised data to train MMT systems, using multimodal English data only. Our method, called ZeroMMT, consists in adapting a strong text-only machine translation (MT) model by training it on a mixture of two objectives: visually conditioned masked language modelling and the Kullback-Leibler divergence between the original and new MMT outputs. We evaluate on standard MMT benchmarks and the recently released CoMMuTE, a contrastive benchmark aiming to evaluate how well models use images to disambiguate English sentences. We obtain disambiguation performance close to state-of-the-art MMT models trained additionally on fully supervised examples. To prove that our method generalizes to languages with no fully supervised training data available, we extend the CoMMuTE evaluation dataset to three new languages: Arabic, Russian and Chinese. We further show that we can control the trade-off between disambiguation capabilities and translation fidelity at inference time using classifier-free guidance and without any additional data. Our code, data and trained models are publicly accessible.
翻译:当前的多模态机器翻译系统依赖于完全监督的数据(即模型在句子及其翻译和伴随图像上进行训练)。然而,这类数据收集成本高昂,限制了MMT向其他不存在此类数据的语言对的扩展。在这项工作中,我们提出了一种方法,仅使用多模态英语数据来绕过训练MMT系统对完全监督数据的需求。我们的方法名为ZeroMMT,其核心在于通过混合两种目标来微调一个强大的纯文本机器翻译模型:视觉条件掩码语言建模以及原始MT输出与新MMT输出之间的Kullback-Leibler散度。我们在标准MMT基准测试以及新近发布的CoMMuTE上进行了评估,后者是一个旨在评估模型如何利用图像消除英语句子歧义的对比基准。我们获得的消歧性能接近额外在完全监督样本上训练的最先进MMT模型。为了证明我们的方法能够泛化到没有完全监督训练数据可用的语言,我们将CoMMuTE评估数据集扩展至三种新语言:阿拉伯语、俄语和中文。我们进一步表明,在推理时可以使用无分类器引导技术来控制消歧能力与翻译保真度之间的权衡,且无需任何额外数据。我们的代码、数据和训练模型均已公开。