We consider the problem of zero-shot one-class visual classification, extending traditional one-class classification to scenarios where only the label of the target class is available. This method aims to discriminate between positive and negative query samples without requiring examples from the target class. We propose a two-step solution that first queries large language models for visually confusing objects and then relies on vision-language pre-trained models (e.g., CLIP) to perform classification. By adapting large-scale vision benchmarks, we demonstrate the ability of the proposed method to outperform adapted off-the-shelf alternatives in this setting. Namely, we propose a realistic benchmark where negative query samples are drawn from the same original dataset as positive ones, including a granularity-controlled version of iNaturalist, where negative samples are at a fixed distance in the taxonomy tree from the positive ones. To our knowledge, we are the first to demonstrate the ability to discriminate a single category from other semantically related ones using only its label.
翻译:我们研究零样本单类视觉分类问题,将传统单类分类扩展到仅可获得目标类别标签的场景。该方法旨在区分正负查询样本,而无需目标类别的示例。我们提出一种两步解决方案:首先查询大语言模型获取视觉上易混淆的对象,随后依赖视觉语言预训练模型(如CLIP)执行分类。通过适配大规模视觉基准测试,我们证明了所提方法在此设定下优于经适配的现成替代方案。具体而言,我们构建了一个现实基准测试,其中负查询样本与正样本来自同一原始数据集,包括粒度受控的iNaturalist版本——其负样本在分类树中与正样本保持固定距离。据我们所知,我们首次证明了仅使用类别标签即可将单一类别与其他语义相关类别进行区分的能力。