Past works on multimodal machine translation (MMT) elevate bilingual setup by incorporating additional aligned vision information. However, an image-must requirement of the multimodal dataset largely hinders MMT's development -- namely that it demands an aligned form of [image, source text, target text]. This limitation is generally troublesome during the inference phase especially when the aligned image is not provided as in the normal NMT setup. Thus, in this work, we introduce IKD-MMT, a novel MMT framework to support the image-free inference phase via an inversion knowledge distillation scheme. In particular, a multimodal feature generator is executed with a knowledge distillation module, which directly generates the multimodal feature from (only) source texts as the input. While there have been a few prior works entertaining the possibility to support image-free inference for machine translation, their performances have yet to rival the image-must translation. In our experiments, we identify our method as the first image-free approach to comprehensively rival or even surpass (almost) all image-must frameworks, and achieved the state-of-the-art result on the often-used Multi30k benchmark. Our code and data are available at: https://github.com/pengr/IKD-mmt/tree/master..
翻译:以往关于多模态机器翻译(MMT)的研究通过引入额外的对齐视觉信息来提升双语翻译性能。然而,多模态数据集对图像的存在性要求(即必须包含[图像、源文本、目标文本]的对齐形式)在很大程度上阻碍了MMT的发展。这一限制在推理阶段尤为棘手,尤其是在标准神经机器翻译(NMT)设置中不提供对齐图像的情况下。为此,本文提出IKD-MMT,一种通过反向知识蒸馏方案支持无图像推理的新型MMT框架。具体而言,我们设计了一个多模态特征生成器,并搭配知识蒸馏模块,该模块仅以源文本为输入直接生成多模态特征。尽管已有少量前期工作探索了支持机器翻译无图像推理的可能性,但其性能仍无法与依赖图像的翻译方法相匹敌。实验表明,本文方法首次实现了全面媲美甚至超越(几乎)所有图像依赖框架的无图像方案,并在广泛使用的Multi30k基准上取得了最先进结果。我们的代码和数据已开源:https://github.com/pengr/IKD-mmt/tree/master。