Modern image classifiers widely adopt global average pooling (GAP) followed by a linear classification head. This linearity ensures that the image-level logits equal the average of logits obtained by applying the classification head pointwise to the feature grid prior to GAP. Consequently, standard classifiers may inherently retain spatial class evidence that remains recoverable even when the image-level prediction is incorrect. This structure naturally suggests a multiple-instance learning (MIL) interpretation, where an image is viewed as a bag of spatial instances. Within this formulation, we demonstrate that standard classifiers trained with a single label per image can still learn the intended classification task in multi-object scenes. We further exploit this property to decompose image-level logits into a prediction grid, providing a post-hoc diagnostic to extract spatial class evidence that GAP otherwise obscures. Our systematic evaluation reveals that off-the-shelf models consistently recover the ground-truth class within foreground regions. The MIL interpretation further suggests that common classifier failures reflect known limitations of mean aggregation.
翻译:现代图像分类器普遍采用全局平均池化后接线性分类头。这种线性结构确保了图像级逻辑值等于将分类头逐点应用于全局平均池化前的特征图所获逻辑值的平均值。因此,标准分类器可能内在地保留了空间类别证据——即使图像级预测错误,这些证据仍可恢复。这种结构自然引出多实例学习的解释:图像可被视为由空间实例组成的包。在此框架下,我们证明使用单标签训练的标-准分类器仍能在多物体场景中学习到预期的分类任务。我们进一步利用这一特性将图像级逻辑值分解为预测网格,提供了一种事后诊断方法来提取被全局平均池化掩盖的空间类别证据。系统评估表明,现成模型能够在前景区域中持续恢复真实类别。多实例学习视角进一步揭示,常见的分类器失败反映了均值聚合的已知局限性。