Given the large-scale multi-modal training of recent vision-based models and their generalization capabilities, understanding the extent of their robustness is critical for their real-world deployment. In this work, we evaluate the resilience of current vision-based models against diverse object-to-background context variations. The majority of robustness evaluation methods have introduced synthetic datasets to induce changes to object characteristics (viewpoints, scale, color) or utilized image transformation techniques (adversarial changes, common corruptions) on real images to simulate shifts in distributions. Recent works have explored leveraging large language models and diffusion models to generate changes in the background. However, these methods either lack in offering control over the changes to be made or distort the object semantics, making them unsuitable for the task. Our method, on the other hand, can induce diverse object-to-background changes while preserving the original semantics and appearance of the object. To achieve this goal, we harness the generative capabilities of text-to-image, image-to-text, and image-to-segment models to automatically generate a broad spectrum of object-to-background changes. We induce both natural and adversarial background changes by either modifying the textual prompts or optimizing the latents and textual embedding of text-to-image models. This allows us to quantify the role of background context in understanding the robustness and generalization of deep neural networks. We produce various versions of standard vision datasets (ImageNet, COCO), incorporating either diverse and realistic backgrounds into the images or introducing color, texture, and adversarial changes in the background. We conduct extensive experiment to analyze the robustness of vision-based models against object-to-background context variations across diverse tasks.
翻译:鉴于近期基于视觉模型的大规模多模态训练及其泛化能力,理解其鲁棒性程度对其实际部署至关重要。本研究评估了当前基于视觉模型在面对多样化对象-背景上下文变化时的恢复力。多数鲁棒性评估方法通过引入合成数据集来改变对象特征(视角、尺度、颜色),或利用图像变换技术(对抗性变化、常见扰动)在真实图像上模拟分布偏移。近期研究探索了利用大语言模型和扩散模型生成背景变化,但这些方法要么缺乏对变化过程的控制,要么扭曲对象语义,不适用于本任务。相比之下,我们的方法能在保持对象原始语义和外观的同时,诱导多样化的对象-背景变化。为实现这一目标,我们利用文本到图像、图像到文本和图像到分割模型的生成能力,自动生成广泛的对象-背景变化谱系。通过修改文本提示或优化文本到图像模型的潜变量和文本嵌入,我们诱导了自然和对抗性背景变化,从而量化背景上下文在理解深度神经网络鲁棒性与泛化能力中的作用。我们生成了标准视觉数据集(ImageNet、COCO)的多个版本,在图像中融入多样且逼真的背景,或引入颜色、纹理及对抗性背景变化。通过大量实验,我们分析了基于视觉模型在不同任务中面对对象-背景上下文变化的鲁棒性。