Many visual scenes can be described as compositions of latent factors. Effective recognition, reasoning, and editing often require not only forming such compositional representations, but also solving the decomposition problem. One popular choice for constructing these representations is through the binding operation. Resonator networks, which can be understood as coupled Hopfield networks, were proposed as a way to perform decomposition on such bound representations. Recent works have shown notable similarities between Hopfield networks and diffusion models. Motivated by these observations, we introduce a framework for semantic decomposition using coupled inference in diffusion models. Our method frames semantic decomposition as an inverse problem and couples the diffusion processes using a reconstruction-driven guidance term that encourages the composition of factor estimates to match the bound vector. We also introduce a novel iterative sampling scheme that improves the performance of our model. Finally, we show that attention-based resonator networks are a special case of our framework. Empirically, we demonstrate that our coupled inference framework outperforms resonator networks across a range of synthetic semantic decomposition tasks.
翻译:许多视觉场景可描述为潜在因子的组合。有效的识别、推理与编辑不仅需要构建此类组合表征,还需解决分解问题。绑定操作是构建此类表征的常用方法之一。谐振子网络——可理解为耦合的霍普菲尔德网络——被提出用于对此类绑定表征执行分解。近期研究表明霍普菲尔德网络与扩散模型存在显著相似性。受此启发,我们提出一种基于扩散模型耦合推理的语义分解框架。该方法将语义分解构建为逆问题,并通过重建驱动的引导项耦合扩散过程,该引导项促使因子估计的组合与绑定向量相匹配。我们还提出一种新颖的迭代采样方案以提升模型性能。最后,我们证明基于注意力的谐振子网络是本框架的特例。实验表明,在一系列合成语义分解任务中,我们的耦合推理框架均优于谐振子网络。