Novelty detection aims at finding samples that differ in some form from the distribution of seen samples. But not all changes are created equal. Data can suffer a multitude of distribution shifts, and we might want to detect only some types of relevant changes. Similar to works in out-of-distribution generalization, we propose to use the formalization of separating into semantic or content changes, that are relevant to our task, and style changes, that are irrelevant. Within this formalization, we define the robust novelty detection as the task of finding semantic changes while being robust to style distributional shifts. Leveraging pretrained, large-scale model representations, we introduce Stylist, a novel method that focuses on dropping environment-biased features. First, we compute a per-feature score based on the feature distribution distances between environments. Next, we show that our selection manages to remove features responsible for spurious correlations and improve novelty detection performance. For evaluation, we adapt domain generalization datasets to our task and analyze the methods behaviors. We additionally built a large synthetic dataset where we have control over the spurious correlations degree. We prove that our selection mechanism improves novelty detection algorithms across multiple datasets, containing both stylistic and content shifts.
翻译:摘要:新颖性检测旨在发现与已观测样本分布在某种程度上存在差异的样本。但并非所有分布变化都具有同等意义。数据可能遭受多种分布偏移,而我们需要仅检测特定类型的相关变化。受分布外泛化相关研究启发,我们采用将变化分离为与任务相关的语义/内容变化(内容变化)和无关的风格变化(风格变化)的形式化框架。在此形式化框架内,我们将稳健新颖性检测定义为:在保持对风格分布偏移鲁棒性的前提下,检测语义变化的任务。利用预训练的大规模模型表示,我们提出Stylist——一种专注于剔除环境偏差特征的新方法。首先,我们基于不同环境间特征分布距离计算每个特征的评分;其次,我们证明该方法有效消除了导致虚假相关的特征,提升了新颖性检测性能。为进行评估,我们将域泛化数据集适配至本任务并分析方法行为。此外,我们构建了可人为控制虚假相关程度的大规模合成数据集,实验证明该特征选择机制能在同时包含风格偏移与内容偏移的多数据集中提升新颖性检测算法性能。