This work addresses the task of weakly-supervised object localization. The goal is to learn object localization using only image-level class labels, which are much easier to obtain compared to bounding box annotations. This task is important because it reduces the need for labor-intensive ground-truth annotations. However, methods for object localization trained using weak supervision often suffer from limited accuracy in localization. To address this challenge and enhance localization accuracy, we propose a multiscale object localization transformer (MOLT). It comprises multiple object localization transformers that extract patch embeddings across various scales. Moreover, we introduce a deep clustering-guided refinement method that further enhances localization accuracy by utilizing separately extracted image segments. These segments are obtained by clustering pixels using convolutional neural networks. Finally, we demonstrate the effectiveness of our proposed method by conducting experiments on the publicly available ILSVRC-2012 dataset.
翻译:本文研究了弱监督目标定位任务。其目标是仅利用图像级别的类别标签学习目标定位,这与边界框标注相比更容易获取。该任务的重要性在于降低了对人工密集标注的需求。然而,使用弱监督训练的目标定位方法在定位精度上往往有限。为了解决这一挑战并提升定位精度,我们提出了多尺度目标定位Transformer(MOLT)。它由多个目标定位Transformer组成,能从不同尺度提取图像块嵌入。此外,我们引入了一种基于深度聚类引导的精化方法,通过利用单独提取的图像片段进一步提升定位精度。这些片段通过使用卷积神经网络对像素进行聚类获得。最后,我们在公开的ILSVRC-2012数据集上进行了实验,验证了所提方法的有效性。