SlotGNN: Unsupervised Discovery of Multi-Object Representations and Visual Dynamics

Learning multi-object dynamics from visual data using unsupervised techniques is challenging due to the need for robust, object representations that can be learned through robot interactions. This paper presents a novel framework with two new architectures: SlotTransport for discovering object representations from RGB images and SlotGNN for predicting their collective dynamics from RGB images and robot interactions. Our SlotTransport architecture is based on slot attention for unsupervised object discovery and uses a feature transport mechanism to maintain temporal alignment in object-centric representations. This enables the discovery of slots that consistently reflect the composition of multi-object scenes. These slots robustly bind to distinct objects, even under heavy occlusion or absence. Our SlotGNN, a novel unsupervised graph-based dynamics model, predicts the future state of multi-object scenes. SlotGNN learns a graph representation of the scene using the discovered slots from SlotTransport and performs relational and spatial reasoning to predict the future appearance of each slot conditioned on robot actions. We demonstrate the effectiveness of SlotTransport in learning object-centric features that accurately encode both visual and positional information. Further, we highlight the accuracy of SlotGNN in downstream robotic tasks, including challenging multi-object rearrangement and long-horizon prediction. Finally, our unsupervised approach proves effective in the real world. With only minimal additional data, our framework robustly predicts slots and their corresponding dynamics in real-world control tasks.

翻译：从视觉数据中通过无监督技术学习多目标动力学极具挑战性，关键在于需要能够通过机器人交互学习的鲁棒目标表示。本文提出一个新型框架，包含两种新架构：用于从RGB图像发现目标表示的SlotTransport，以及用于从RGB图像和机器人交互预测其集体动力学的SlotGNN。我们的SlotTransport架构基于槽注意力机制实现无监督目标发现，并采用特征传输机制维持以目标为中心的表征在时间维度上的对齐性。这使得发现的槽能够持续反映多目标场景的组成结构，即便在严重遮挡或目标缺失的情况下，这些槽也能鲁棒地绑定到不同目标。我们提出的SlotGNN是一种新型无监督图基动力学模型，可预测多目标场景的未来状态。该模型利用SlotTransport发现的槽构建场景图表示，通过关系推理与空间推理，根据机器人动作条件预测每个槽的未来外观。我们证明了SlotTransport在学习准确编码视觉与位置信息的目标中心特征方面的有效性。此外，我们突出展示了SlotGNN在具有挑战性的多目标重排与长时程预测等下游机器人任务中的准确性。最后，我们提出的无监督方法在真实世界中验证了有效性。仅需极少量额外数据，该框架便能在真实控制任务中鲁棒地预测槽及其对应动力学。