Adversarial Testing for Visual Grounding via Image-Aware Property Reduction

Due to the advantages of fusing information from various modalities, multimodal learning is gaining increasing attention. Being a fundamental task of multimodal learning, Visual Grounding (VG), aims to locate objects in images through natural language expressions. Ensuring the quality of VG models presents significant challenges due to the complex nature of the task. In the black box scenario, existing adversarial testing techniques often fail to fully exploit the potential of both modalities of information. They typically apply perturbations based solely on either the image or text information, disregarding the crucial correlation between the two modalities, which would lead to failures in test oracles or an inability to effectively challenge VG models. To this end, we propose PEELING, a text perturbation approach via image-aware property reduction for adversarial testing of the VG model. The core idea is to reduce the property-related information in the original expression meanwhile ensuring the reduced expression can still uniquely describe the original object in the image. To achieve this, PEELING first conducts the object and properties extraction and recombination to generate candidate property reduction expressions. It then selects the satisfied expressions that accurately describe the original object while ensuring no other objects in the image fulfill the expression, through querying the image with a visual understanding technique. We evaluate PEELING on the state-of-the-art VG model, i.e. OFA-VG, involving three commonly used datasets. Results show that the adversarial tests generated by PEELING achieves 21.4% in MultiModal Impact score (MMI), and outperforms state-of-the-art baselines for images and texts by 8.2%--15.1%.

翻译：由于融合多模态信息的优势，多模态学习正受到日益广泛的关注。作为多模态学习的基础任务，视觉定位旨在通过自然语言表达在图像中定位目标对象。由于该任务的复杂特性，确保视觉定位模型的质量面临重大挑战。在黑盒场景下，现有对抗性测试技术往往未能充分利用两种模态信息的潜力。它们通常仅基于图像或文本信息施加扰动，忽略了两种模态之间的关键关联性，这会导致测试预言失效或无法有效挑战视觉定位模型。为此，我们提出PEELING方法，这是一种通过图像感知属性约简对视觉定位模型进行对抗性测试的文本扰动方法。其核心思想是在减少原始表达中属性相关信息的同时，确保约简后的表达仍能唯一描述图像中的原始目标对象。为实现这一目标，PEELING首先进行目标与属性提取及重组，生成候选属性约简表达。随后通过视觉理解技术查询图像，选择那些能准确描述原始对象且确保图像中无其他对象满足该表达的表达式。我们在最先进的视觉定位模型OFA-VG上对PEELING进行了评估，涉及三个常用数据集。结果表明，PEELING生成的对抗性测试在多模态影响分数（MMI）上达到21.4%，并在图像和文本方面比最先进的基线方法提升8.2%至15.1%。