3D visual illusion is a perceptual phenomenon where a two-dimensional plane is manipulated to simulate three-dimensional spatial relationships, making a flat artwork or object look three-dimensional in the human visual system. In this paper, we reveal that the machine visual system is also seriously fooled by 3D visual illusions, including monocular and binocular depth estimation. In order to explore and analyze the impact of 3D visual illusion on depth estimation, we collect a large dataset containing almost 3k scenes and 200k images to train and evaluate SOTA monocular and binocular depth estimation methods. We also propose a 3D visual illusion depth estimation framework that uses common sense from the vision language model to adaptively fuse depth from binocular disparity and monocular depth. Experiments show that SOTA monocular, binocular, and multi-view depth estimation approaches are all fooled by various 3D visual illusions, while our method achieves SOTA performance.
翻译:三维视觉错觉是一种感知现象,通过操纵二维平面来模拟三维空间关系,使平面艺术作品或物体在人类视觉系统中呈现三维外观。本文揭示,机器视觉系统同样会受到三维视觉错觉的严重误导,包括单目与双目深度估计。为探究和分析三维视觉错觉对深度估计的影响,我们收集了一个包含近3000个场景和20万张图像的大型数据集,用于训练和评估当前最优的单目与双目深度估计方法。我们还提出了一种三维视觉错觉深度估计框架,该框架利用视觉语言模型的常识知识,自适应地融合双目视差与单目深度信息。实验表明,当前最优的单目、双目及多视图深度估计方法均受到各类三维视觉错觉的干扰,而我们的方法取得了最优性能。