ObjectCompose: Evaluating Resilience of Vision-Based Models on Object-to-Background Compositional Changes

Given the large-scale multi-modal training of recent vision-based models and their generalization capabilities, understanding the extent of their robustness is critical for their real-world deployment. In this work, we evaluate the resilience of current vision-based models against diverse object-to-background context variations. The majority of robustness evaluation methods have introduced synthetic datasets to induce changes to object characteristics (viewpoints, scale, color) or utilized image transformation techniques (adversarial changes, common corruptions) on real images to simulate shifts in distributions. Recent works have explored leveraging large language models and diffusion models to generate changes in the background. However, these methods either lack in offering control over the changes to be made or distort the object semantics, making them unsuitable for the task. Our method, on the other hand, can induce diverse object-to-background changes while preserving the original semantics and appearance of the object. To achieve this goal, we harness the generative capabilities of text-to-image, image-to-text, and image-to-segment models to automatically generate a broad spectrum of object-to-background changes. We induce both natural and adversarial background changes by either modifying the textual prompts or optimizing the latents and textual embedding of text-to-image models. This allows us to quantify the role of background context in understanding the robustness and generalization of deep neural networks. We produce various versions of standard vision datasets (ImageNet, COCO), incorporating either diverse and realistic backgrounds into the images or introducing color, texture, and adversarial changes in the background. We conduct extensive experiment to analyze the robustness of vision-based models against object-to-background context variations across diverse tasks.

翻译：鉴于近期基于视觉模型的大规模多模态训练及其泛化能力，理解其鲁棒性程度对其实际部署至关重要。本研究评估了当前基于视觉模型在面对多样化对象-背景上下文变化时的恢复力。多数鲁棒性评估方法通过引入合成数据集来改变对象特征（视角、尺度、颜色），或利用图像变换技术（对抗性变化、常见扰动）在真实图像上模拟分布偏移。近期研究探索了利用大语言模型和扩散模型生成背景变化，但这些方法要么缺乏对变化过程的控制，要么扭曲对象语义，不适用于本任务。相比之下，我们的方法能在保持对象原始语义和外观的同时，诱导多样化的对象-背景变化。为实现这一目标，我们利用文本到图像、图像到文本和图像到分割模型的生成能力，自动生成广泛的对象-背景变化谱系。通过修改文本提示或优化文本到图像模型的潜变量和文本嵌入，我们诱导了自然和对抗性背景变化，从而量化背景上下文在理解深度神经网络鲁棒性与泛化能力中的作用。我们生成了标准视觉数据集（ImageNet、COCO）的多个版本，在图像中融入多样且逼真的背景，或引入颜色、纹理及对抗性背景变化。通过大量实验，我们分析了基于视觉模型在不同任务中面对对象-背景上下文变化的鲁棒性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日