The key to high-level cognition is believed to be the ability to systematically manipulate and compose knowledge pieces. While token-like structured knowledge representations are naturally provided in text, it is elusive how to obtain them for unstructured modalities such as scene images. In this paper, we propose a neural mechanism called Neural Systematic Binder or SysBinder for constructing a novel structured representation called Block-Slot Representation. In Block-Slot Representation, object-centric representations known as slots are constructed by composing a set of independent factor representations called blocks, to facilitate systematic generalization. SysBinder obtains this structure in an unsupervised way by alternatingly applying two different binding principles: spatial binding for spatial modularity across the full scene and factor binding for factor modularity within an object. SysBinder is a simple, deterministic, and general-purpose layer that can be applied as a drop-in module in any arbitrary neural network and on any modality. In experiments, we find that SysBinder provides significantly better factor disentanglement within the slots than the conventional object-centric methods, including, for the first time, in visually complex scene images such as CLEVR-Tex. Furthermore, we demonstrate factor-level systematicity in controlled scene generation by decoding unseen factor combinations.
翻译:高维认知的关键在于系统性地操作与组合知识片段的能力。尽管类标记的结构化知识表征在文本中自然存在,但在场景图像等非结构化模态中如何获取这类表征仍悬而未决。本文提出名为"神经系统性绑定器"(Neural Systematic Binder,简称SysBinder)的神经机制,用于构建新型结构化表征——块-槽表征。在块-槽表征中,通过组合一组称为"块"(blocks)的独立因子表征来构建以对象为中心的"槽"(slots)表征,从而促进系统性泛化。SysBinder通过交替应用两种不同的绑定原则实现无监督结构学习:空间绑定用于跨全场景的空间模块化,因子绑定用于对象内部的因子模块化。该机制作为简单、确定性且通用的网络层,可作为即插即用模块嵌入任意神经网络与模态。实验表明,在CLEVR-Tex等视觉复杂场景图像中,SysBinder首次实现比传统对象中心方法更优的槽内因子解耦。此外,通过解码未见因子组合,我们验证了其在受控场景生成中的因子级系统性。