Spatially Grounded Concept Bottleneck Models via Part-Factorized Attention

Concept bottleneck models (CBMs) predict a layer of human-named attributes before predicting a class, which makes their decisions auditable. On fine-grained recognition tasks the concept heads are usually free to attend anywhere in the image, so a head named for one body region can be satisfied by evidence on another. This work studies a part-factorized CBM that removes that freedom by construction. The method has three components built on a frozen DINOv3 vision transformer. A learned foreground gate, trained on DINOv3 patch features, suppresses background patches inside the part attention. A set of part queries cross-attends to patch features and each of the 312 CUB attributes is routed, through a fixed concept-to-part map, to read only from the part token its name implies. A learnable two-dimensional Gaussian prior, injected additively in log space into the attention logits, breaks the permutation symmetry among part queries; its means are initialized from the dataset-average keypoint location of each part, which requires no per-image keypoint supervision at training or test time. On CUB-200-2011 the spatial-prior model matches a fully supervised baseline (88.85% versus 88.95% top-1) while raising pointing accuracy by 16 points (52.6% versus 36.4%). Replacing bounding-box supervision with a PCA foreground target and combining it with the Gaussian prior removes all per-image supervision and reaches 88.6% top-1 at about 70% pointing accuracy. A keypoint-fraction sweep shows that 0.5% of the training set (about 27 images) suffices to initialize the prior with no measurable loss. Removing part identity entirely is the harder case: without any spatial prior, pointing accuracy collapses to $2.9\%$.

翻译：概念瓶颈模型通过预测由人类命名的属性层后再预测类别，使模型决策具有可审计性。在细粒度识别任务中，概念头通常可自由关注图像任意区域，导致命名用于某个身体部位的概念头可能被其他区域的证据激活。本文提出一种通过构建消除该自由度的部件分解概念瓶颈模型。该方法基于冻结的DINOv3视觉变换器构建三个组件：第一，学习型前景门控机制利用DINOv3图像块特征抑制部件注意力中的背景块；第二，一组部件查询通过交叉注意力与图像块特征交互，312个CUB属性通过固定概念-部件映射路由，仅从名称对应的部件标记中读取信息；第三，可学习的二维高斯先验以对数空间加法形式注入注意力对数中，打破部件查询间的置换对称性——其均值由数据集各部件平均关键点位置初始化，训练与测试阶段均无需单张图像的关键点标注。在CUB-200-2011数据集上，空间先验模型性能与全监督基线持平（top-1准确率88.85% vs 88.95%），同时将指向准确率提升16个百分点（52.6% vs 36.4%）。将边界框监督替换为PCA前景目标并与高斯先验结合后，完全消除每张图像的监督需求，在约70%指向准确率下达到88.6%的top-1准确率。关键点比例扫描实验表明，仅需训练集0.5%的图像（约27张）即可完成先验初始化且无精度损失。完全去除部件标识的极端情况显示，无任何空间先验时指向准确率骤降至$2.9\%$。