We devise deep nearest centroids (DNC), a conceptually elegant yet surprisingly effective network for large-scale visual recognition, by revisiting Nearest Centroids, one of the most classic and simple classifiers. Current deep models learn the classifier in a fully parametric manner, ignoring the latent data structure and lacking simplicity and explainability. DNC instead conducts nonparametric, case-based reasoning; it utilizes sub-centroids of training samples to describe class distributions and clearly explains the classification as the proximity of test data and the class sub-centroids in the feature space. Due to the distance-based nature, the network output dimensionality is flexible, and all the learnable parameters are only for data embedding. That means all the knowledge learnt for ImageNet classification can be completely transferred for pixel recognition learning, under the "pre-training and fine-tuning" paradigm. Apart from its nested simplicity and intuitive decision-making mechanism, DNC can even possess ad-hoc explainability when the sub-centroids are selected as actual training images that humans can view and inspect. Compared with parametric counterparts, DNC performs better on image classification (CIFAR-10, ImageNet) and greatly boots pixel recognition (ADE20K, Cityscapes), with improved transparency and fewer learnable parameters, using various network architectures (ResNet, Swin) and segmentation models (FCN, DeepLabV3, Swin). We feel this work brings fundamental insights into related fields.
翻译:我们提出了深度最近质心(DNC)网络,通过重新审视经典且简单的最近质心分类器,设计了一种概念优雅且效果惊人的大规模视觉识别网络。当前深度模型以完全参数化方式学习分类器,忽略了潜在数据结构,缺乏简洁性与可解释性。DNC则采用非参数化的基于案例推理方法,利用训练样本的子质心描述类别分布,并通过测试数据与类别子质心在特征空间中的距离清晰解释分类结果。基于距离的本质使网络输出维度灵活,所有可学习参数仅用于数据嵌入——这意味着在“预训练与微调”范式下,ImageNet分类任务习得的所有知识均可完全迁移至像素识别学习。除嵌套的简洁性与直观决策机制外,当子质心被选择为人类可查看与检验的真实训练图像时,DNC甚至具备即插即用的可解释性。与参数化模型相比,DNC在图像分类(CIFAR-10、ImageNet)中表现更优,显著提升像素识别(ADE20K、Cityscapes)性能,并凭借多种网络架构(ResNet、Swin)和分割模型(FCN、DeepLabV3、Swin)实现了更高的透明度和更少的可学习参数。我们认为这项工作为相关领域带来了基础性洞见。