Indoor 3D object detection is an essential task in single image scene understanding, impacting spatial cognition fundamentally in visual reasoning. Existing works on 3D object detection from a single image either pursue this goal through independent predictions of each object or implicitly reason over all possible objects, failing to harness relational geometric information between objects. To address this problem, we propose a dynamic sparse graph pipeline named Explicit3D based on object geometry and semantics features. Taking the efficiency into consideration, we further define a relatedness score and design a novel dynamic pruning algorithm followed by a cluster sampling method for sparse scene graph generation and updating. Furthermore, our Explicit3D introduces homogeneous matrices and defines new relative loss and corner loss to model the spatial difference between target pairs explicitly. Instead of using ground-truth labels as direct supervision, our relative and corner loss are derived from the homogeneous transformation, which renders the model to learn the geometric consistency between objects. The experimental results on the SUN RGB-D dataset demonstrate that our Explicit3D achieves better performance balance than the-state-of-the-art.
翻译:室内三维目标检测是单图像场景理解中的一项关键任务,从根本上影响视觉推理中的空间认知。现有基于单图像的三维目标检测工作要么通过对每个目标进行独立预测来实现该目标,要么隐式地对所有可能的目标进行推理,未能利用目标之间的几何关系信息。为解决这一问题,我们提出了一种名为Explicit3D的动态稀疏图管道,其基于目标几何与语义特征。考虑到效率,我们进一步定义相关性分数,并设计了一种新颖的动态剪枝算法,随后采用聚类采样方法进行稀疏场景图的生成与更新。此外,我们的Explicit3D引入齐次矩阵,并定义新的相对损失和角点损失,以显式建模目标对之间的空间差异。与使用真实标签作为直接监督不同,我们的相对损失和角点损失源自齐次变换,从而促使模型学习目标之间的几何一致性。在SUN RGB-D数据集上的实验结果表明,我们的Explicit3D相比现有最优方法实现了更好的性能平衡。