In prediction tasks, there exist features that are related to the label in the same way across different settings for that task; these are semantic features or semantics. Features with varying relationships to the label are nuisances. For example, in detecting cows from natural images, the shape of the head is a semantic but because images of cows often have grass backgrounds but not always, the background is a nuisance. Relationships between a nuisance and the label are unstable across settings and, consequently, models that exploit nuisance-label relationships face performance degradation when these relationships change. Direct knowledge of a nuisance helps build models that are robust to such changes, but requires extra annotations beyond labels and covariates. In this paper, we develop an alternative way to produce robust models by data augmentation. These data augmentations corrupt semantic information to produce models that identify and adjust for where nuisances drive predictions. We study semantic corruptions in powering different spurious-correlation avoiding methods on multiple out-of distribution (OOD) tasks like classifying waterbirds, natural language inference (NLI), and detecting cardiomegaly in chest X-rays.
翻译:在预测任务中,存在一类特征,它们在不同任务场景下与标签具有相同的关联方式,这类特征被称为语义特征或语义信息;而与标签存在可变关系的特征则称为干扰。例如,在自然图像中检测奶牛时,头部形状属于语义特征,但奶牛图像常呈现草地背景(并非总是如此),因此背景成为干扰。干扰与标签的关系在不同场景中是不稳定的,当这种关系发生改变时,利用干扰-标签关系的模型性能会下降。直接了解干扰有助于构建对此类变化具有鲁棒性的模型,但这需要除标签和协变量之外的额外标注。本文提出一种通过数据增强生成鲁棒模型的替代方法。这些数据增强通过破坏语义信息,使模型能够识别并调整干扰对预测的影响。我们在多个分布外(OOD)任务(如水鸟分类、自然语言推理(NLI)和胸部X光片心脏肥大检测)中,研究了语义破坏在支撑不同虚假相关性规避方法时的作用。