We tackle the problem of robust novelty detection, where we aim to detect novelties in terms of semantic content while being invariant to changes in other, irrelevant factors. Specifically, we operate in a setup with multiple environments, where we determine the set of features that are associated more with the environments, rather than to the content relevant for the task. Thus, we propose a method that starts with a pretrained embedding and a multi-env setup and manages to rank the features based on their environment-focus. First, we compute a per-feature score based on the feature distribution variance between envs. Next, we show that by dropping the highly scored ones, we manage to remove spurious correlations and improve the overall performance by up to 6%, both in covariance and sub-population shift cases, both for a real and a synthetic benchmark, that we introduce for this task.
翻译:我们探讨鲁棒性新奇检测问题,旨在检测语义内容层面的新奇性,同时保持对无关环境因素变化的不变性。具体而言,我们在多环境设置下运行,确定与环境的关联性高于任务相关内容的特征集。我们提出一种方法:基于预训练嵌入与多环境配置,根据特征的环境偏好程度进行排序。首先,通过计算特征在不同环境间的分布方差得到每个特征的评分。其次,实验表明,剔除高评分特征可消除虚假相关性,并在协变量偏移与子种群偏移场景下将整体性能提升高达6%——该结果同时适用于我们为此任务引入的真实与合成基准数据集。