Agents with the ability to comprehend and reason about the dynamics of objects would be expected to exhibit improved robustness and generalization in novel scenarios. However, achieving this capability necessitates not only an effective scene representation but also an understanding of the mechanisms governing interactions among object subsets. Recent studies have made significant progress in representing scenes using object slots. In this work, we introduce Reusable Slotwise Mechanisms, or RSM, a framework that models object dynamics by leveraging communication among slots along with a modular architecture capable of dynamically selecting reusable mechanisms for predicting the future states of each object slot. Crucially, RSM leverages the Central Contextual Information (CCI), enabling selected mechanisms to access the remaining slots through a bottleneck, effectively allowing for modeling of higher order and complex interactions that might require a sparse subset of objects. Experimental results demonstrate the superior performance of RSM compared to state-of-the-art methods across various future prediction and related downstream tasks, including Visual Question Answering and action planning. Furthermore, we showcase RSM's Out-of-Distribution generalization ability to handle scenes in intricate scenarios.
翻译:具备理解和推理物体动态能力的主体,预期将在新场景中展现出更强的鲁棒性和泛化能力。然而,实现这一能力不仅需要有效的场景表征,还需理解支配物体子集间相互作用的机制。近年研究在利用物体槽位表征场景方面取得显著进展。本文提出可复用槽位机制(RSM)框架,该框架通过槽位间通信与模块化架构建模物体动态,该架构能动态选择可复用机制以预测各物体槽位的未来状态。关键在于,RSM利用中央上下文信息(CCI)机制,使所选机制可通过信息瓶颈访问其余槽位,从而有效建模可能需要稀疏物体子集的高阶复杂交互。实验结果表明,在各类未来预测及相关下游任务(包括视觉问答与动作规划)中,RSM的性能均优于现有最先进方法。此外,我们展现了RSM在处理复杂场景时的分布外泛化能力。