ObjectCompose: Evaluating Resilience of Vision-Based Models on Object-to-Background Compositional Changes

Given the large-scale multi-modal training of recent vision-based models and their generalization capabilities, understanding the extent of their robustness is critical for their real-world deployment. In this work, we evaluate the resilience of current vision-based models against diverse object-to-background context variations. The majority of robustness evaluation methods have introduced synthetic datasets to induce changes to object characteristics (viewpoints, scale, color) or utilized image transformation techniques (adversarial changes, common corruptions) on real images to simulate shifts in distributions. Recent works have explored leveraging large language models and diffusion models to generate changes in the background. However, these methods either lack in offering control over the changes to be made or distort the object semantics, making them unsuitable for the task. Our method, on the other hand, can induce diverse object-to-background changes while preserving the original semantics and appearance of the object. To achieve this goal, we harness the generative capabilities of text-to-image, image-to-text, and image-to-segment models to automatically generate a broad spectrum of object-to-background changes. We induce both natural and adversarial background changes by either modifying the textual prompts or optimizing the latents and textual embedding of text-to-image models. This allows us to quantify the role of background context in understanding the robustness and generalization of deep neural networks. We produce various versions of standard vision datasets (ImageNet, COCO), incorporating either diverse and realistic backgrounds into the images or introducing color, texture, and adversarial changes in the background. We conduct extensive experiment to analyze the robustness of vision-based models against object-to-background context variations across diverse tasks.

翻译：鉴于近期基于视觉的大规模多模态模型及其泛化能力，评估其鲁棒性程度对实际部署至关重要。本研究针对当前视觉模型在对象与背景上下文变化场景下的抗干扰能力展开评估。现有鲁棒性评估方法主要通过构建合成数据集改变对象属性（视角、尺度、颜色），或对真实图像施加图像变换技术（对抗性扰动、常见损坏）来模拟分布偏移。近期研究尝试利用大语言模型和扩散模型生成背景变化，但这些方法要么缺乏对变化过程的控制能力，要么会扭曲对象语义信息，难以满足任务需求。相比之下，本方法能在保持对象原始语义与外观的前提下，诱导多样化的对象-背景组合变化。为此，我们利用文本到图像、图像到文本、图像到分割模型的生成能力，自动生成涵盖广泛类型的对象-背景变化。通过修改文本提示或优化文本到图像模型的潜在空间与文本嵌入，我们既可诱导自然的背景变化，也可产生对抗性背景变化，从而量化背景上下文对深度神经网络鲁棒性与泛化能力的影响。我们构建了标准视觉数据集（ImageNet, COCO）的多个变体版本，引入多样且逼真的背景图像，或在背景中叠加颜色、纹理及对抗性变化。通过系统性实验，深入分析了基于视觉的模型在不同任务场景下面对对象-背景上下文变化的鲁棒性表现。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日