Multimodal learning involves developing models that can integrate information from various sources like images and texts. In this field, multimodal text generation is a crucial aspect that involves processing data from multiple modalities and outputting text. The image-guided story ending generation (IgSEG) is a particularly significant task, targeting on an understanding of complex relationships between text and image data with a complete story text ending. Unfortunately, deep neural networks, which are the backbone of recent IgSEG models, are vulnerable to adversarial samples. Current adversarial attack methods mainly focus on single-modality data and do not analyze adversarial attacks for multimodal text generation tasks that use cross-modal information. To this end, we propose an iterative adversarial attack method (Iterative-attack) that fuses image and text modality attacks, allowing for an attack search for adversarial text and image in an more effective iterative way. Experimental results demonstrate that the proposed method outperforms existing single-modal and non-iterative multimodal attack methods, indicating the potential for improving the adversarial robustness of multimodal text generation models, such as multimodal machine translation, multimodal question answering, etc.
翻译:多模态学习旨在开发能够整合图像与文本等多种信息源的模型。在该领域中,多模态文本生成作为关键研究方向,需处理来自多种模态的数据并输出文本。图像引导故事结局生成(IgSEG)是一项尤为重要的任务,其核心在于理解文本与图像数据间的复杂关联,以生成完整的故事文本结尾。然而,作为当前IgSEG模型支柱的深度神经网络,易受对抗样本攻击。现有对抗攻击方法主要聚焦于单模态数据,尚未针对利用跨模态信息的多模态文本生成任务展开对抗攻击分析。为此,本文提出一种融合图像与文本模态攻击的迭代对抗攻击方法(Iterative-attack),通过更高效的迭代方式实现对抗文本与图像的联合攻击搜索。实验结果表明,所提方法优于现有单模态及非迭代多模态攻击方法,展现了提升多模态文本生成模型(如多模态机器翻译、多模态问答等)对抗鲁棒性的潜力。