Visual information has been introduced for enhancing machine translation (MT), and its effectiveness heavily relies on the availability of large amounts of bilingual parallel sentence pairs with manual image annotations. In this paper, we introduce a stable diffusion-based imagination network into a multimodal large language model (MLLM) to explicitly generate an image for each source sentence, thereby advancing the multimodel MT. Particularly, we build heuristic human feedback with reinforcement learning to ensure the consistency of the generated image with the source sentence without the supervision of image annotation, which breaks the bottleneck of using visual information in MT. Furthermore, the proposed method enables imaginative visual information to be integrated into large-scale text-only MT in addition to multimodal MT. Experimental results show that our model significantly outperforms existing multimodal MT and text-only MT, especially achieving an average improvement of more than 14 BLEU points on Multi30K multimodal MT benchmarks.
翻译:视觉信息已被引入以增强机器翻译(MT),其有效性在很大程度上依赖于大量带有手动图像标注的双语平行句对。本文中,我们将一个基于稳定扩散的想象网络引入多模态大语言模型(MLLM),以显式地为每个源语句生成一幅图像,从而推动多模态机器翻译的发展。特别地,我们构建了启发式人类反馈与强化学习相结合的方法,以确保生成图像与源语句的一致性,而无需图像标注的监督,这打破了在机器翻译中使用视觉信息的瓶颈。此外,所提出的方法使得想象出的视觉信息不仅能够整合到多模态机器翻译中,还能整合到大规模纯文本机器翻译中。实验结果表明,我们的模型显著优于现有的多模态机器翻译和纯文本机器翻译,特别是在Multi30K多模态机器翻译基准测试中平均提升了超过14个BLEU分。