Indoor 3D object detection is an essential task in single image scene understanding, impacting spatial cognition fundamentally in visual reasoning. Existing works on 3D object detection from a single image either pursue this goal through independent predictions of each object or implicitly reason over all possible objects, failing to harness relational geometric information between objects. To address this problem, we propose a dynamic sparse graph pipeline named Explicit3D based on object geometry and semantics features. Taking the efficiency into consideration, we further define a relatedness score and design a novel dynamic pruning algorithm followed by a cluster sampling method for sparse scene graph generation and updating. Furthermore, our Explicit3D introduces homogeneous matrices and defines new relative loss and corner loss to model the spatial difference between target pairs explicitly. Instead of using ground-truth labels as direct supervision, our relative and corner loss are derived from the homogeneous transformation, which renders the model to learn the geometric consistency between objects. The experimental results on the SUN RGB-D dataset demonstrate that our Explicit3D achieves better performance balance than the-state-of-the-art.
翻译:室内三维目标检测是单图像场景理解中的关键任务,直接影响视觉推理中的空间认知基础。现有基于单图像的三维目标检测方法要么通过独立预测每个对象来实现目标,要么对所有可能对象进行隐式推理,均未能利用对象间的几何关联信息。针对这一问题,我们提出了一种基于对象几何与语义特征的动态稀疏图结构——Explicit3D。考虑效率因素,我们进一步定义了相关性分数,并设计了一种新型动态剪枝算法,结合聚类采样方法实现稀疏场景图的生成与更新。此外,Explicit3D引入齐次矩阵,并定义了新的相对损失与角点损失,以显式建模目标对之间的空间差异。与直接使用真值标签作为监督信号不同,我们的相对损失和角点损失源自齐次变换,使模型能够学习对象间的几何一致性。在SUN RGB-D数据集上的实验结果表明,Explicit3D在性能平衡方面优于现有最先进方法。