Panoramic image enables deeper understanding and more holistic perception of $360^\circ$ surrounding environment, which can naturally encode enriched scene context information compared to standard perspective image. Previous work has made lots of effort to solve the scene understanding task in a bottom-up form, thus each sub-task is processed separately and few correlations are explored in this procedure. In this paper, we propose a novel method using depth prior for holistic indoor scene understanding which recovers the objects' shapes, oriented bounding boxes and the 3D room layout simultaneously from a single panorama. In order to fully utilize the rich context information, we design a transformer-based context module to predict the representation and relationship among each component of the scene. In addition, we introduce a real-world dataset for scene understanding, including photo-realistic panoramas, high-fidelity depth images, accurately annotated room layouts, and oriented object bounding boxes and shapes. Experiments on the synthetic and real-world datasets demonstrate that our method outperforms previous panoramic scene understanding methods in terms of both layout estimation and 3D object detection.
翻译:全景图像能够从$360^\circ$全方位感知周围环境,相较于标准透视图像,能够更自然地编码丰富的场景上下文信息。此前的研究大多采用自下而上的方式解决场景理解任务,每个子任务被独立处理,且在此过程中很少探索各任务间的关联。本文提出了一种基于深度先验的新型室内场景整体理解方法,能够从单张全景图像中同时恢复物体形状、有向边界框和三维房间布局。为充分利用丰富的上下文信息,我们设计了一个基于Transformer的上下文模块,用于预测场景中各组件的表示及其相互关系。此外,我们引入了一个面向场景理解的真实世界数据集,包含逼真的全景图、高保真深度图像、精确标注的房间布局、有向物体边界框及物体形状。在合成与真实世界数据集上的实验表明,本方法在布局估计和三维目标检测方面均优于以往的全景场景理解方法。