Category-Extensible Out-of-Distribution Detection via Hierarchical Context Descriptions

from arxiv, Accepted by 37th Conference on Neural Information Processing Systems (NeurIPS 2023). Code is available at https://github.com/alibaba/catex

The key to OOD detection has two aspects: generalized feature representation and precise category description. Recently, vision-language models such as CLIP provide significant advances in both two issues, but constructing precise category descriptions is still in its infancy due to the absence of unseen categories. This work introduces two hierarchical contexts, namely perceptual context and spurious context, to carefully describe the precise category boundary through automatic prompt tuning. Specifically, perceptual contexts perceive the inter-category difference (e.g., cats vs apples) for current classification tasks, while spurious contexts further identify spurious (similar but exactly not) OOD samples for every single category (e.g., cats vs panthers, apples vs peaches). The two contexts hierarchically construct the precise description for a certain category, which is, first roughly classifying a sample to the predicted category and then delicately identifying whether it is truly an ID sample or actually OOD. Moreover, the precise descriptions for those categories within the vision-language framework present a novel application: CATegory-EXtensible OOD detection (CATEX). One can efficiently extend the set of recognizable categories by simply merging the hierarchical contexts learned under different sub-task settings. And extensive experiments are conducted to demonstrate CATEX's effectiveness, robustness, and category-extensibility. For instance, CATEX consistently surpasses the rivals by a large margin with several protocols on the challenging ImageNet-1K dataset. In addition, we offer new insights on how to efficiently scale up the prompt engineering in vision-language models to recognize thousands of object categories, as well as how to incorporate large language models (like GPT-3) to boost zero-shot applications. Code is publicly available at https://github.com/alibaba/catex.

翻译：分布外检测的关键在于两个方面：广义特征表示与精确类别描述。近期，诸如CLIP等视觉-语言模型在这两个问题上均取得了显著进展，但由于未见类别的缺失，构建精确的类别描述仍处于初级阶段。本研究引入两种分层上下文——感知上下文与伪相关上下文，通过自动提示调优来精细描述类别边界。具体而言，感知上下文感知当前分类任务中的类间差异（例如猫与苹果），而伪相关上下文则进一步为每个独立类别识别伪相关（相似但非同类）的分布外样本（例如猫与豹、苹果与桃子）。这两种上下文分层构建特定类别的精确描述：首先将样本粗略分类至预测类别，继而精细判别其是否确为分布内样本或实属分布外。此外，在视觉-语言框架下构建的精确类别描述催生了一项创新应用：类别可扩展的分布外检测。通过简单整合在不同子任务设置下学习的分层上下文，即可高效扩展可识别类别集合。大量实验验证了该方法的有效性、鲁棒性与类别扩展性。例如，在具有挑战性的ImageNet-1K数据集上，该方法通过多种协议持续以显著优势超越现有方法。此外，本研究还提出了关于如何高效扩展视觉-语言模型中的提示工程以识别数千个物体类别，以及如何融合大型语言模型（如GPT-3）增强零样本应用的新见解。代码公开于https://github.com/alibaba/catex。