Visual entailment (VE) is a multimodal reasoning task consisting of image-sentence pairs whereby a promise is defined by an image, and a hypothesis is described by a sentence. The goal is to predict whether the image semantically entails the sentence. VE systems have been widely adopted in many downstream tasks. Metamorphic testing is the commonest technique for AI algorithms, but it poses a significant challenge for VE testing. They either only consider perturbations on single modality which would result in ineffective tests due to the destruction of the relationship of image-text pair, or just conduct shallow perturbations on the inputs which can hardly detect the decision error made by VE systems. Motivated by the fact that objects in the image are the fundamental element for reasoning, we propose VEglue, an object-aligned joint erasing approach for VE systems testing. It first aligns the object regions in the premise and object descriptions in the hypothesis to identify linked and un-linked objects. Then, based on the alignment information, three Metamorphic Relations are designed to jointly erase the objects of the two modalities. We evaluate VEglue on four widely-used VE systems involving two public datasets. Results show that VEglue could detect 11,609 issues on average, which is 194%-2,846% more than the baselines. In addition, VEglue could reach 52.5% Issue Finding Rate (IFR) on average, and significantly outperform the baselines by 17.1%-38.2%. Furthermore, we leverage the tests generated by VEglue to retrain the VE systems, which largely improves model performance (50.8% increase in accuracy) on newly generated tests without sacrificing the accuracy on the original test set.
翻译:视觉蕴含(VE)是一种多模态推理任务,由图像-句子对组成,其中前提由图像定义,假设由句子描述。其目标是预测图像是否在语义上蕴含该句子。VE系统已被广泛应用于许多下游任务。蜕变测试是人工智能算法最常用的技术,但这对VE测试构成了重大挑战。现有方法要么仅考虑对单模态进行扰动,由于破坏了图像-文本对的关系而导致测试无效,要么仅对输入进行浅层扰动,难以检测VE系统做出的决策错误。受图像中对象是推理基本要素这一事实启发,我们提出VEglue,一种针对VE系统测试的对象对齐联合擦除方法。它首先对齐前提中的对象区域与假设中的对象描述,以识别关联对象与非关联对象;然后基于对齐信息,设计三种蜕变关系,以联合擦除两模态的对象。我们在两个公开数据集上对四种广泛使用的VE系统进行了评估。结果表明,VEglue平均可检测11,609个问题,比基线方法多194%-2,846%。此外,VEglue的平均问题发现率(IFR)达到52.5%,显著优于基线方法17.1%-38.2%。进一步地,我们利用VEglue生成的测试用例重新训练VE系统,在保证原始测试集精度不受损的前提下,使模型在新生成测试上的性能大幅提升(准确率提升50.8%)。