TB-HSU: Hierarchical 3D Scene Understanding with Contextual Affordances

The concept of function and affordance is a critical aspect of 3D scene understanding and supports task-oriented objectives. In this work, we develop a model that learns to structure and vary functional affordance across a 3D hierarchical scene graph representing the spatial organization of a scene. The varying functional affordance is designed to integrate with the varying spatial context of the graph. More specifically, we develop an algorithm that learns to construct a 3D hierarchical scene graph (3DHSG) that captures the spatial organization of the scene. Starting from segmented object point clouds and object semantic labels, we develop a 3DHSG with a top node that identifies the room label, child nodes that define local spatial regions inside the room with region-specific affordances, and grand-child nodes indicating object locations and object-specific affordances. To support this work, we create a custom 3DHSG dataset that provides ground truth data for local spatial regions with region-specific affordances and also object-specific affordances for each object. We employ a transformer-based model to learn the 3DHSG. We use a multi-task learning framework that learns both room classification and learns to define spatial regions within the room with region-specific affordances. Our work improves on the performance of state-of-the-art baseline models and shows one approach for applying transformer models to 3D scene understanding and the generation of 3DHSGs that capture the spatial organization of a room. The code and dataset are publicly available.

翻译：功能与可供性概念是三维场景理解的关键方面，并支持任务导向的目标。本研究开发了一种模型，该模型能够学习在表示场景空间组织的三维分层场景图中构建并变化功能可供性。变化的功能可供性被设计为与图中变化的空间上下文相整合。具体而言，我们开发了一种算法，用于学习构建能捕捉场景空间组织的三维分层场景图（3DHSG）。从分割后的物体点云和物体语义标签出发，我们构建了一个三维分层场景图，其顶层节点标识房间标签，子节点定义房间内具有区域特定可供性的局部空间区域，孙节点则指示物体位置及物体特定可供性。为支持本项工作，我们创建了一个定制化的三维分层场景图数据集，该数据集为局部空间区域提供具有区域特定可供性的真实标注数据，并为每个物体提供物体特定可供性标注。我们采用基于Transformer的模型来学习三维分层场景图。我们使用多任务学习框架，该框架同时学习房间分类以及学习定义房间内具有区域特定可供性的空间区域。我们的工作提升了现有最先进基线模型的性能，并展示了一种将Transformer模型应用于三维场景理解及生成能捕捉房间空间组织的三维分层场景图的方法。代码与数据集已公开提供。