Computer vision methods that explicitly detect object parts and reason on them are a step towards inherently interpretable models. Existing approaches that perform part discovery driven by a fine-grained classification task make very restrictive assumptions on the geometric properties of the discovered parts; they should be small and compact. Although this prior is useful in some cases, in this paper we show that pre-trained transformer-based vision models, such as self-supervised DINOv2 ViT, enable the relaxation of these constraints. In particular, we find that a total variation (TV) prior, which allows for multiple connected components of any size, substantially outperforms previous work. We test our approach on three fine-grained classification benchmarks: CUB, PartImageNet and Oxford Flowers, and compare our results to previously published methods as well as a re-implementation of the state-of-the-art method PDiscoNet with a transformer-based backbone. We consistently obtain substantial improvements across the board, both on part discovery metrics and the downstream classification task, showing that the strong inductive biases in self-supervised ViT models require to rethink the geometric priors that can be used for unsupervised part discovery.
翻译:显式检测物体部件并基于其进行推理的计算机视觉方法,是迈向本质可解释模型的重要一步。现有基于细粒度分类任务驱动部件发现的方法,对发现部件的几何属性施加了极强的约束:部件必须小而紧凑。尽管这种先验在某些情况下具有价值,但本文证明,基于预训练Transformer的视觉模型(例如自监督DINOv2 ViT)能够有效放宽这些约束。具体而言,我们发现采用全变分(TV)先验——允许任意尺寸的多个连通分量——能够显著超越先前工作。我们在三个细粒度分类基准数据集(CUB、PartImageNet和Oxford Flowers)上测试了所提方法,并与已发表方法及采用Transformer骨干网络重新实现的最先进方法PDiscoNet进行了对比。实验结果表明,无论是在部件发现指标还是下游分类任务上,我们的方法均取得全面且显著的性能提升。这证明自监督ViT模型所蕴含的强大归纳偏置,要求我们重新思考可用于无监督部件发现的几何先验。