Recent works in self-supervised learning have shown impressive results on single-object images, but they struggle to perform well on complex multi-object images as evidenced by their poor visual grounding. To demonstrate this concretely, we propose visual difference attention (VDA) to compute visual attention maps in an unsupervised fashion by comparing an image with its salient-regions-masked-out version. We use VDA to derive attention maps for state-of-the art SSL methods and show they do not highlight all salient regions in an image accurately, suggesting their inability to learn strong representations for downstream tasks like segmentation. Motivated by these limitations, we cast VDA as a differentiable operation and propose a new learning objective, Differentiable Difference Attention (DiDA) loss, which leads to substantial improvements in an SSL model's visually grounding to an image's salient regions.
翻译:近期自监督学习研究在单目标图像上取得了显著成果,但在复杂的多目标图像上表现不佳,这从其较弱的视觉定位能力中可见一斑。为具体证实该问题,我们提出视觉差异注意力(VDA)方法,通过对比原始图像与其显著区域被遮挡的版本,以无监督方式计算视觉注意力图。利用VDA为当前最先进的自监督学习模型生成注意力图后发现,这些模型未能准确高亮图像中所有显著区域,表明其难以学习适用于分割等下游任务的强表征。基于上述局限性,我们将VDA转化为可微操作,并提出新学习目标——可微差异注意力(DiDA)损失函数,该函数能够显著提升自监督学习模型对图像显著区域的视觉定位能力。