Weakly supervised object localization (WSOL) is a challenging task aiming to localize objects with only image-level supervision. Recent works apply visual transformer to WSOL and achieve significant success by exploiting the long-range feature dependency in self-attention mechanism. However, existing transformer-based methods synthesize the classification feature maps as the localization map, which leads to optimization conflicts between classification and localization tasks. To address this problem, we propose to learn a task-specific spatial-aware token (SAT) to condition localization in a weakly supervised manner. Specifically, a spatial token is first introduced in the input space to aggregate representations for localization task. Then a spatial aware attention module is constructed, which allows spatial token to generate foreground probabilities of different patches by querying and to extract localization knowledge from the classification task. Besides, for the problem of sparse and unbalanced pixel-level supervision obtained from the image-level label, two spatial constraints, including batch area loss and normalization loss, are designed to compensate and enhance this supervision. Experiments show that the proposed SAT achieves state-of-the-art performance on both CUB-200 and ImageNet, with 98.45% and 73.13% GT-known Loc, respectively. Even under the extreme setting of using only 1 image per class from ImageNet for training, SAT already exceeds the SOTA method by 2.1% GT-known Loc. Code and models are available at https://github.com/wpy1999/SAT.
翻译:弱监督目标定位(WSOL)是一项具有挑战性的任务,旨在仅利用图像级监督实现目标定位。近期研究将视觉Transformer应用于WSOL,并利用自注意力机制中的长程特征依赖取得了显著成功。然而,现有基于Transformer的方法将分类特征图综合为定位图,导致分类与定位任务之间存在优化冲突。针对该问题,我们提出学习一种任务特定的空间感知令牌(SAT),以弱监督方式约束定位过程。具体而言,首先在输入空间中引入空间令牌,为定位任务聚合表示;随后构建空间感知注意力模块,使空间令牌能够通过查询生成不同块的前景概率,并从分类任务中提取定位知识。此外,针对图像级标签产生的稀疏且不平衡的像素级监督问题,设计了包括批次面积损失和归一化损失在内的两项空间约束,以补偿并增强该监督信号。实验表明,所提出的SAT在CUB-200和ImageNet数据集上均实现了最优性能,GT-known Loc分别达到98.45%和73.13%。即便在仅使用ImageNet每类1张图像的极端训练设置下,SAT仍以2.1%的GT-known Loc超越现有最优方法。代码与模型已开源至https://github.com/wpy1999/SAT。