Human 3D vision involves two distinct stages: an Experience Module, where stereo depth is extracted relative to fixation, and an Inference Module, where this experience is interpreted to estimate 3D scene properties. Paradoxically, although stereo vision does not provide us with absolute distance information, it nonetheless affects our inferences about distance. We propose the Inference Module exploits a natural scene statistic: near scenes produce vivid disparity gradients, while far scenes appear comparatively flat. QualiaNet implements this two-stage architecture computationally: disparity maps simulating human stereo experience are passed to a CNN trained to estimate distance. The network can recover distance from disparity gradients alone, validating this approach.
翻译:人类三维视觉涉及两个不同阶段:经验模块(从注视点提取立体深度信息)与推理模块(对经验进行解释以估算三维场景属性)。矛盾的是,尽管立体视觉无法提供绝对距离信息,却仍影响着我们对距离的推断。我们提出推理模块利用了一项自然场景统计特性:近处场景产生鲜明视差梯度,而远处场景相对平坦。QualiaNet通过计算方式实现该两阶段架构:模拟人类立体视觉经验的视差图被输入至经训练的距离估算卷积神经网络。该网络仅凭视差梯度即可恢复距离信息,验证了该方法的有效性。