The ability to decompose complex natural scenes into meaningful object-centric abstractions lies at the core of human perception and reasoning. In the recent culmination of unsupervised object-centric learning, the Slot-Attention module has played an important role with its simple yet effective design and fostered many powerful variants. These methods, however, have been exceedingly difficult to train without supervision and are ambiguous in the notion of object, especially for complex natural scenes. In this paper, we propose to address these issues by investigating the potential of learnable queries as initializations for Slot-Attention learning, uniting it with efforts from existing attempts on improving Slot-Attention learning with bi-level optimization. With simple code adjustments on Slot-Attention, our model, Bi-level Optimized Query Slot Attention, achieves state-of-the-art results on 3 challenging synthetic and 7 complex real-world datasets in unsupervised image segmentation and reconstruction, outperforming previous baselines by a large margin. We provide thorough ablative studies to validate the necessity and effectiveness of our design. Additionally, our model exhibits great potential for concept binding and zero-shot learning. Our work is made publicly available at https://bo-qsa.github.io
翻译:将复杂自然场景分解为有意义的以对象为中心的抽象概念的能力,是人类感知与推理的核心。在无监督以对象为中心学习的近期成果中,Slot-Attention模块以其简洁高效的设计发挥了重要作用,并催生了许多强大的变体。然而,这些方法在没有监督的情况下极难训练,且对对象的定义模糊不清,尤其是在复杂的自然场景中。本文通过探究可学习查询作为Slot-Attention学习初始化的潜力,并结合现有通过双层优化改进Slot-Attention学习的尝试,提出了解决这些问题的方案。通过对Slot-Attention进行简单的代码调整,我们的模型——双层优化查询Slot注意力(Bi-level Optimized Query Slot Attention),在无监督图像分割与重建任务中,于3个具有挑战性的合成数据集和7个复杂真实世界数据集上取得了最先进的结果,大幅超越了先前基线方法。我们提供了详尽的消融研究,以验证我们设计的必要性和有效性。此外,我们的模型在概念绑定和零样本学习方面展现出巨大潜力。我们的工作已在https://bo-qsa.github.io公开提供。