Automatically discovering composable abstractions from raw perceptual data is a long-standing challenge in machine learning. Recent slot-based neural networks that learn about objects in a self-supervised manner have made exciting progress in this direction. However, they typically fall short at adequately capturing spatial symmetries present in the visual world, which leads to sample inefficiency, such as when entangling object appearance and pose. In this paper, we present a simple yet highly effective method for incorporating spatial symmetries via slot-centric reference frames. We incorporate equivariance to per-object pose transformations into the attention and generation mechanism of Slot Attention by translating, scaling, and rotating position encodings. These changes result in little computational overhead, are easy to implement, and can result in large gains in terms of data efficiency and overall improvements to object discovery. We evaluate our method on a wide range of synthetic object discovery benchmarks namely CLEVR, Tetrominoes, CLEVRTex, Objects Room and MultiShapeNet, and show promising improvements on the challenging real-world Waymo Open dataset.
翻译:从原始感知数据中自动发现可组合的抽象概念是机器学习领域长期存在的挑战。近期,基于槽的神经网络以自监督方式学习物体特征,在这一方向上取得了令人振奋的进展。然而,这些方法通常无法充分捕捉视觉世界中存在的空间对称性,导致样本效率低下,例如将物体外观与姿态纠缠在一起。本文提出一种简单而高效的方法,通过以槽为中心的参考系融入空间对称性。我们通过平移、缩放和旋转位置编码,将对每物体姿态变换的等变性引入注意力与生成机制。这些改动计算开销小、易于实现,并能显著提升数据效率及物体发现的整体性能。我们在多个合成物体发现基准(CLEVR、Tetrominoes、CLEVRTex、Objects Room 和 MultiShapeNet)上评估了该方法,并在具有挑战性的真实世界 Waymo Open 数据集上展示了令人鼓舞的性能提升。