Spatial-Aware Token for Weakly Supervised Object Localization

Weakly supervised object localization (WSOL) is a challenging task aiming to localize objects with only image-level supervision. Recent works apply visual transformer to WSOL and achieve significant success by exploiting the long-range feature dependency in self-attention mechanism. However, existing transformer-based methods synthesize the classification feature maps as the localization map, which leads to optimization conflicts between classification and localization tasks. To address this problem, we propose to learn a task-specific spatial-aware token (SAT) to condition localization in a weakly supervised manner. Specifically, a spatial token is first introduced in the input space to aggregate representations for localization task. Then a spatial aware attention module is constructed, which allows spatial token to generate foreground probabilities of different patches by querying and to extract localization knowledge from the classification task. Besides, for the problem of sparse and unbalanced pixel-level supervision obtained from the image-level label, two spatial constraints, including batch area loss and normalization loss, are designed to compensate and enhance this supervision. Experiments show that the proposed SAT achieves state-of-the-art performance on both CUB-200 and ImageNet, with 98.45% and 73.13% GT-known Loc, respectively. Even under the extreme setting of using only 1 image per class from ImageNet for training, SAT already exceeds the SOTA method by 2.1% GT-known Loc. Code and models are available at https://github.com/wpy1999/SAT.

翻译：弱监督目标定位（WSOL）是一项具有挑战性的任务，旨在仅利用图像级监督实现目标定位。近期研究将视觉Transformer应用于WSOL，并利用自注意力机制中的长程特征依赖取得了显著成功。然而，现有基于Transformer的方法将分类特征图综合为定位图，导致分类与定位任务之间存在优化冲突。针对该问题，我们提出学习一种任务特定的空间感知令牌（SAT），以弱监督方式约束定位过程。具体而言，首先在输入空间中引入空间令牌，为定位任务聚合表示；随后构建空间感知注意力模块，使空间令牌能够通过查询生成不同块的前景概率，并从分类任务中提取定位知识。此外，针对图像级标签产生的稀疏且不平衡的像素级监督问题，设计了包括批次面积损失和归一化损失在内的两项空间约束，以补偿并增强该监督信号。实验表明，所提出的SAT在CUB-200和ImageNet数据集上均实现了最优性能，GT-known Loc分别达到98.45%和73.13%。即便在仅使用ImageNet每类1张图像的极端训练设置下，SAT仍以2.1%的GT-known Loc超越现有最优方法。代码与模型已开源至https://github.com/wpy1999/SAT。

相关内容

SAT

关注 0

SAT是研究者关注命题可满足性问题的理论与应用的第一次年度会议。除了简单命题可满足性外，它还包括布尔优化（如MaxSAT和伪布尔（PB）约束）、量化布尔公式（QBF）、可满足性模理论（SMT）和约束规划（CP），用于与布尔级推理有明确联系的问题。官网链接：http://sat2019.tecnico.ulisboa.pt/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日