We tackle the problem of robust novelty detection, where we aim to detect novelties in terms of semantic content while being invariant to changes in other, irrelevant factors. Specifically, we operate in a setup with multiple environments, where we determine the set of features that are associated more with the environments, rather than to the content relevant for the task. Thus, we propose a method that starts with a pretrained embedding and a multi-env setup and manages to rank the features based on their environment-focus. First, we compute a per-feature score based on the feature distribution variance between envs. Next, we show that by dropping the highly scored ones, we manage to remove spurious correlations and improve the overall performance by up to 6%, both in covariance and sub-population shift cases, both for a real and a synthetic benchmark, that we introduce for this task.
翻译:我们研究鲁棒异常检测问题,旨在检测语义内容层面的异常,同时保持对其他无关因素变化的不变性。具体而言,我们在多环境设置下开展工作,通过确定与任务相关特征相比更倾向于环境关联的特征集合。为此,我们提出一种方法:首先基于预训练嵌入向量和多环境配置,依据特征的环境聚焦程度进行排序。具体流程包括:第一步,计算各特征在不同环境间的分布方差作为评分;第二步,通过剔除高分特征,可有效消除虚假相关性,在协变量偏移和子总体偏移场景下均能实现高达6%的性能提升。该方法已通过我们为此任务引入的真实基准与合成基准得到验证。