Learning Disentangled Representation in Object-Centric Models for Visual Dynamics Prediction via Transformers

Recent work has shown that object-centric representations can greatly help improve the accuracy of learning dynamics while also bringing interpretability. In this work, we take this idea one step further, ask the following question: "can learning disentangled representation further improve the accuracy of visual dynamics prediction in object-centric models?" While there has been some attempt to learn such disentangled representations for the case of static images \citep{nsb}, to the best of our knowledge, ours is the first work which tries to do this in a general setting for video, without making any specific assumptions about the kind of attributes that an object might have. The key building block of our architecture is the notion of a {\em block}, where several blocks together constitute an object. Each block is represented as a linear combination of a given number of learnable concept vectors, which is iteratively refined during the learning process. The blocks in our model are discovered in an unsupervised manner, by attending over object masks, in a style similar to discovery of slots \citep{slot_attention}, for learning a dense object-centric representation. We employ self-attention via transformers over the discovered blocks to predict the next state resulting in discovery of visual dynamics. We perform a series of experiments on several benchmark 2-D, and 3-D datasets demonstrating that our architecture (1) can discover semantically meaningful blocks (2) help improve accuracy of dynamics prediction compared to SOTA object-centric models (3) perform significantly better in OOD setting where the specific attribute combinations are not seen earlier during training. Our experiments highlight the importance discovery of disentangled representation for visual dynamics prediction.

翻译：近期研究表明，物体中心表示能够显著提升动态学习精度，同时增强模型可解释性。本研究在此基础上进一步探索，提出核心问题："在物体中心模型中，学习解耦表示能否进一步提升视觉动态预测的精度？"尽管已有研究尝试在静态图像场景中学习此类解耦表示\citep{nsb}，据我们所知，本文是首个在通用视频场景中实现该目标的工作，且无需对物体属性类型作任何特定假设。我们架构的核心构建单元是"块"的概念，多个块共同构成一个物体。每个块表示为给定数量的可学习概念向量的线性组合，该组合会在学习过程中迭代优化。模型中的块以无监督方式发现，通过对物体掩码进行注意力计算实现，其风格类似于\citep{slot_attention}中用于学习稠密物体中心表示的槽发现机制。我们利用Transformer对已发现的块执行自注意力计算来预测下一状态，从而实现视觉动态的发现。我们在多个基准2D和3D数据集上进行系列实验，结果表明我们的架构能够：(1) 发现具有语义意义的块；(2) 相比最先进的物体中心模型提升动态预测精度；(3) 在训练阶段未出现特定属性组合的分布外场景中表现显著更优。实验凸显了解耦表示发现对视觉动态预测的重要性。