Unsupervised object discovery is becoming an essential line of research for tackling recognition problems that require decomposing an image into entities, such as semantic segmentation and object detection. Recently, object-centric methods that leverage self-supervision have gained popularity, due to their simplicity and adaptability to different settings and conditions. However, those methods do not exploit effective techniques already employed in modern self-supervised approaches. In this work, we consider an object-centric approach in which DINO ViT features are reconstructed via a set of queried representations called slots. Based on that, we propose a masking scheme on input features that selectively disregards the background regions, inducing our model to focus more on salient objects during the reconstruction phase. Moreover, we extend the slot attention to a multi-query approach, allowing the model to learn multiple sets of slots, producing more stable masks. During training, these multiple sets of slots are learned independently while, at test time, these sets are merged through Hungarian matching to obtain the final slots. Our experimental results and ablations on the PASCAL-VOC 2012 dataset show the importance of each component and highlight how their combination consistently improves object localization. Our source code is available at: https://github.com/rishavpramanik/maskedmultiqueryslot
翻译:无监督目标发现正成为解决需将图像分解为实体(如语义分割和目标检测)的识别问题的重要研究方向。近年来,利用自监督的以目标为中心的方法因其简单性和对不同设置与条件的适应性而广受欢迎。然而,这些方法并未充分利用现代自监督方法中已采用的有效技术。在本工作中,我们考虑一种以目标为中心的方法,通过一组称为槽(slot)的查询表示来重建DINO ViT特征。基于此,我们提出了对输入特征的掩码方案,该方案有选择地忽略背景区域,引导模型在重建阶段更关注显著目标。此外,我们将槽注意力扩展为多查询方法,使模型能够学习多组槽,从而生成更稳定的掩码。在训练过程中,这些多组槽被独立学习,而在测试时,这些组通过匈牙利匹配进行合并以获得最终槽。我们在PASCAL-VOC 2012数据集上的实验和消融研究展示了每个组件的重要性,并突显了它们的组合如何持续改进目标定位。我们的源代码可在以下网址获取:https://github.com/rishavpramanik/maskedmultiqueryslot