World modelling, i.e. building a representation of the rules that govern the world so as to predict its evolution, is an essential ability for any agent interacting with the physical world. Recent applications of the Transformer architecture to the problem of world modelling from video input show notable improvements in sample efficiency. However, existing approaches tend to work only at the image level thus disregarding that the environment is composed of objects interacting with each other. In this paper, we propose an architecture combining Transformers for world modelling with the slot-attention paradigm, an approach for learning representations of objects appearing in a scene. We describe the resulting neural architecture and report experimental results showing an improvement over the existing solutions in terms of sample efficiency and a reduction of the variation of the performance over the training examples. The code for our architecture and experiments is available at https://github.com/torchipeppo/transformers-and-slot-encoding-for-wm
翻译:世界建模,即构建支配世界运行规律的表示以预测其演化过程,是任何与物理世界交互的智能体必备的核心能力。近期将Transformer架构应用于视频输入的世界建模问题中,显示出样本效率的显著提升。然而,现有方法往往仅在图像层面进行处理,忽略了环境由相互作用的物体构成这一事实。本文提出一种将Transformer世界建模与槽注意力范式相结合的架构,后者是一种学习场景中物体表示的方法。我们详细描述了由此产生的神经架构,并通过实验结果表明:相较于现有解决方案,该方法在样本效率方面有所提升,同时降低了训练示例上的性能波动。我们的架构与实验代码已发布于 https://github.com/torchipeppo/transformers-and-slot-encoding-for-wm