With the increasing influence of social media, online misinformation has grown to become a societal issue. The motivation for our work comes from the threat caused by cheapfakes, where an unaltered image is described using a news caption in a new but false-context. The main challenge in detecting such out-of-context multimedia is the unavailability of large-scale datasets. Several detection methods employ randomly selected captions to generate out-of-context training inputs. However, these randomly matched captions are not truly representative of out-of-context scenarios due to inconsistencies between the image description and the matched caption. We aim to address these limitations by introducing a novel task of out-of-context caption generation. In this work, we propose a new method that generates a realistic out-of-context caption given visual and textual context. We also demonstrate that the semantics of the generated captions can be controlled using the textual context. We also evaluate our method against several baselines and our method improves over the image captioning baseline by 6.2% BLUE-4, 2.96% CiDEr, 11.5% ROUGE, and 7.3% METEOR
翻译:摘要:随着社交媒体的影响力日益增强,网络虚假信息已发展成为一个社会性问题。本研究的动机源于"廉价伪造"造成的威胁——在这种伪造中,未修改的图像被配以新闻标题,但置于错误的新语境中。检测此类脱离语境多媒体内容的主要挑战在于缺乏大规模数据集。现有检测方法通常采用随机选取标题的方式生成脱离语境的训练输入,然而,由于图像描述与所匹配标题之间存在不一致性,这些随机匹配的标题并不能真实反映脱离语境的场景。为解决上述局限,我们提出了一项新颖的脱离语境字幕生成任务。本研究提出了一种新方法,该方法能够根据视觉与文本语境生成逼真的脱离语境字幕。同时,我们证明了可通过文本语境控制生成字幕的语义。针对多个基线模型的对比评估表明,本方法在图像字幕生成基线基础上,BLUE-4指标提升6.2%,CiDEr指标提升2.96%,ROUGE指标提升11.5%,METEOR指标提升7.3%。