Automatically discovering composable abstractions from raw perceptual data is a long-standing challenge in machine learning. Recent slot-based neural networks that learn about objects in a self-supervised manner have made exciting progress in this direction. However, they typically fall short at adequately capturing spatial symmetries present in the visual world, which leads to sample inefficiency, such as when entangling object appearance and pose. In this paper, we present a simple yet highly effective method for incorporating spatial symmetries via slot-centric reference frames. We incorporate equivariance to per-object pose transformations into the attention and generation mechanism of Slot Attention by translating, scaling, and rotating position encodings. These changes result in little computational overhead, are easy to implement, and can result in large gains in terms of data efficiency and overall improvements to object discovery. We evaluate our method on a wide range of synthetic object discovery benchmarks namely CLEVR, Tetrominoes, CLEVRTex, Objects Room and MultiShapeNet, and show promising improvements on the challenging real-world Waymo Open dataset.
翻译:自动从原始感知数据中发现可组合抽象是机器学习中长期存在的挑战。最近,以自监督方式学习对象的槽基神经网络在这一方向上取得了令人兴奋的进展。然而,这些方法通常难以充分捕捉视觉世界中存在的空间对称性,导致样本效率低下,例如将对象外观与姿态混淆。在本文中,我们提出了一种简单却高效的方法,通过槽中心参考系融入空间对称性。我们通过平移、缩放和旋转位置编码,将每个对象姿态变换的等变性融入槽注意力的注意力和生成机制中。这些改动计算开销极小,易于实现,并能显著提升数据效率和对象发现的整体性能。我们在多种合成对象发现基准(包括CLEVR、Tetrominoes、CLEVRTex、Objects Room和MultiShapeNet)上评估了方法,并在具有挑战性的真实世界Waymo Open数据集上展示了有前景的改进。