We revisit Semantic Scene Completion (SSC), a useful task to predict the semantic and occupancy representation of 3D scenes, in this paper. A number of methods for this task are always based on voxelized scene representations for keeping local scene structure. However, due to the existence of visible empty voxels, these methods always suffer from heavy computation redundancy when the network goes deeper, and thus limit the completion quality. To address this dilemma, we propose our novel point-voxel aggregation network for this task. Firstly, we transfer the voxelized scenes to point clouds by removing these visible empty voxels and adopt a deep point stream to capture semantic information from the scene efficiently. Meanwhile, a light-weight voxel stream containing only two 3D convolution layers preserves local structures of the voxelized scenes. Furthermore, we design an anisotropic voxel aggregation operator to fuse the structure details from the voxel stream into the point stream, and a semantic-aware propagation module to enhance the up-sampling process in the point stream by semantic labels. We demonstrate that our model surpasses state-of-the-arts on two benchmarks by a large margin, with only depth images as the input.
翻译:本文重新审视了语义场景补全(SSC)这一预测3D场景语义与占用表示的有效任务。现有方法大多基于体素化场景表示以保持局部场景结构。然而,由于可见空体素的存在,这些方法在网络加深时通常面临严重的计算冗余问题,从而限制了补全质量。为解决这一困境,我们提出了新颖的点-体素聚合网络。首先,通过剔除这些可见空体素将体素化场景转换为点云,并采用深度点云流高效地从场景中捕获语义信息。同时,一个仅包含两层3D卷积的轻量级体素流保留了体素化场景的局部结构。此外,我们设计了各向异性体素聚合算子以将体素流中的结构细节融合到点云流中,以及语义感知传播模块通过语义标签增强点云流中的上采样过程。实验表明,仅以深度图像作为输入,我们的模型在两个基准数据集上大幅超越了现有最先进方法。