Multimodal machine translation (MMT) simultaneously takes the source sentence and a relevant image as input for translation. Since there is no paired image available for the input sentence in most cases, recent studies suggest utilizing powerful text-to-image generation models to provide image inputs. Nevertheless, synthetic images generated by these models often follow different distributions compared to authentic images. Consequently, using authentic images for training and synthetic images for inference can introduce a distribution shift, resulting in performance degradation during inference. To tackle this challenge, in this paper, we feed synthetic and authentic images to the MMT model, respectively. Then we minimize the gap between the synthetic and authentic images by drawing close the input image representations of the Transformer Encoder and the output distributions of the Transformer Decoder. Therefore, we mitigate the distribution disparity introduced by the synthetic images during inference, thereby freeing the authentic images from the inference process.Experimental results show that our approach achieves state-of-the-art performance on the Multi30K En-De and En-Fr datasets, while remaining independent of authentic images during inference.
翻译:多模态机器翻译(MMT)同时将源句和相关图像作为输入进行翻译。由于在大多数情况下输入句子没有配对的图像可用,最近的研究建议利用强大的文本到图像生成模型来提供图像输入。然而,这些模型生成的合成图像通常与真实图像的分布不同。因此,使用真实图像进行训练和合成图像进行推理会引入分布偏移,导致推理性能下降。为了解决这一挑战,本文分别将合成图像和真实图像输入MMT模型。然后,通过拉近Transformer编码器的输入图像表示和Transformer解码器的输出分布,最小化合成图像和真实图像之间的差距。因此,我们减轻了推理过程中合成图像引入的分布差异,从而使推理过程无需依赖真实图像。实验结果表明,我们的方法在Multi30K英德和英法数据集上取得了最先进的性能,同时在推理过程中保持对真实图像的独立性。