Given the large-scale multi-modal training of recent vision-based models and their generalization capabilities, understanding the extent of their robustness is critical for their real-world deployment. In this work, we evaluate the resilience of current vision-based models against diverse object-to-background context variations. The majority of robustness evaluation methods have introduced synthetic datasets to induce changes to object characteristics (viewpoints, scale, color) or utilized image transformation techniques (adversarial changes, common corruptions) on real images to simulate shifts in distributions. Recent works have explored leveraging large language models and diffusion models to generate changes in the background. However, these methods either lack in offering control over the changes to be made or distort the object semantics, making them unsuitable for the task. Our method, on the other hand, can induce diverse object-to-background changes while preserving the original semantics and appearance of the object. To achieve this goal, we harness the generative capabilities of text-to-image, image-to-text, and image-to-segment models to automatically generate a broad spectrum of object-to-background changes. We induce both natural and adversarial background changes by either modifying the textual prompts or optimizing the latents and textual embedding of text-to-image models. This allows us to quantify the role of background context in understanding the robustness and generalization of deep neural networks. We produce various versions of standard vision datasets (ImageNet, COCO), incorporating either diverse and realistic backgrounds into the images or introducing color, texture, and adversarial changes in the background. We conduct extensive experiment to analyze the robustness of vision-based models against object-to-background context variations across diverse tasks.
翻译:鉴于近期基于视觉的大规模多模态模型及其泛化能力,评估其鲁棒性程度对实际部署至关重要。本研究针对当前视觉模型在对象与背景上下文变化场景下的抗干扰能力展开评估。现有鲁棒性评估方法主要通过构建合成数据集改变对象属性(视角、尺度、颜色),或对真实图像施加图像变换技术(对抗性扰动、常见损坏)来模拟分布偏移。近期研究尝试利用大语言模型和扩散模型生成背景变化,但这些方法要么缺乏对变化过程的控制能力,要么会扭曲对象语义信息,难以满足任务需求。相比之下,本方法能在保持对象原始语义与外观的前提下,诱导多样化的对象-背景组合变化。为此,我们利用文本到图像、图像到文本、图像到分割模型的生成能力,自动生成涵盖广泛类型的对象-背景变化。通过修改文本提示或优化文本到图像模型的潜在空间与文本嵌入,我们既可诱导自然的背景变化,也可产生对抗性背景变化,从而量化背景上下文对深度神经网络鲁棒性与泛化能力的影响。我们构建了标准视觉数据集(ImageNet, COCO)的多个变体版本,引入多样且逼真的背景图像,或在背景中叠加颜色、纹理及对抗性变化。通过系统性实验,深入分析了基于视觉的模型在不同任务场景下面对对象-背景上下文变化的鲁棒性表现。