We describe a first step towards learning general-purpose visual representations of physical scenes using only image prediction as a training criterion. To do so, we first define "physical scene" and show that, even though different agents may maintain different representations of the same scene, the underlying physical scene that can be inferred is unique. Then, we show that NeRFs cannot represent the physical scene, as they lack extrapolation mechanisms. Those, however, could be provided by Diffusion Models, at least in theory. To test this hypothesis empirically, NeRFs can be combined with Diffusion Models, a process we refer to as NeRF Diffusion, used as unsupervised representations of the physical scene. Our analysis is limited to visual data, without external grounding mechanisms that can be provided by independent sensory modalities.
翻译:我们描述了利用仅以图像预测为训练准则来学习物理场景通用视觉表征的第一步。为此,我们首先定义“物理场景”,并指出即使不同智能体可能对同一场景维持不同表征,但可推断的底层物理场景仍是唯一的。随后,我们证明神经辐射场(NeRF)因缺乏外推机制而无法表征物理场景,而扩散模型(至少在理论上)可提供此类外推能力。为实证检验这一假设,我们将NeRF与扩散模型结合,形成称为“NeRF扩散”的过程,并将其用作物理场景的无监督表征。本分析仅限于视觉数据,未涉及独立感官模态可提供的外部接地机制。