Weakly Supervised Object Localization via Transformer with Implicit Spatial Calibration

Weakly Supervised Object Localization (WSOL), which aims to localize objects by only using image-level labels, has attracted much attention because of its low annotation cost in real applications. Recent studies leverage the advantage of self-attention in visual Transformer for long-range dependency to re-active semantic regions, aiming to avoid partial activation in traditional class activation mapping (CAM). However, the long-range modeling in Transformer neglects the inherent spatial coherence of the object, and it usually diffuses the semantic-aware regions far from the object boundary, making localization results significantly larger or far smaller. To address such an issue, we introduce a simple yet effective Spatial Calibration Module (SCM) for accurate WSOL, incorporating semantic similarities of patch tokens and their spatial relationships into a unified diffusion model. Specifically, we introduce a learnable parameter to dynamically adjust the semantic correlations and spatial context intensities for effective information propagation. In practice, SCM is designed as an external module of Transformer, and can be removed during inference to reduce the computation cost. The object-sensitive localization ability is implicitly embedded into the Transformer encoder through optimization in the training phase. It enables the generated attention maps to capture the sharper object boundaries and filter the object-irrelevant background area. Extensive experimental results demonstrate the effectiveness of the proposed method, which significantly outperforms its counterpart TS-CAM on both CUB-200 and ImageNet-1K benchmarks. The code is available at https://github.com/164140757/SCM.

翻译：弱监督目标定位（WSOL）旨在仅利用图像级标签实现目标定位，因其在实际应用中标注成本低廉而备受关注。近期研究利用视觉Transformer中自注意力机制的长程依赖优势重新激活语义区域，旨在避免传统类激活映射（CAM）中的局部激活问题。然而，Transformer的长程建模忽略了目标固有的空间连贯性，常导致语义感知区域扩散至远离目标边界的位置，使定位结果显著偏大或过小。为解决此问题，我们提出一种简单有效的空间校准模块（SCM），将图像块标记的语义相似性及其空间关系整合到统一扩散模型中，实现精准WSOL。具体而言，我们引入可学习参数动态调整语义关联与空间上下文强度，以促进有效信息传播。实际应用中，SCM作为Transformer的外部模块设计，可在推理阶段移除以降低计算开销。通过训练阶段的优化，目标敏感定位能力被隐式嵌入Transformer编码器，使生成的注意力图能捕捉更清晰的目标边界并过滤无关背景区域。大量实验结果表明，该方法在CUB-200和ImageNet-1K基准测试中均显著优于同类方法TS-CAM。代码已开源至https://github.com/164140757/SCM。