Indoor 3D object detection is an essential task in single image scene understanding, impacting spatial cognition fundamentally in visual reasoning. Existing works on 3D object detection from a single image either pursue this goal through independent predictions of each object or implicitly reason over all possible objects, failing to harness relational geometric information between objects. To address this problem, we propose a dynamic sparse graph pipeline named Explicit3D based on object geometry and semantics features. Taking the efficiency into consideration, we further define a relatedness score and design a novel dynamic pruning algorithm followed by a cluster sampling method for sparse scene graph generation and updating. Furthermore, our Explicit3D introduces homogeneous matrices and defines new relative loss and corner loss to model the spatial difference between target pairs explicitly. Instead of using ground-truth labels as direct supervision, our relative and corner loss are derived from the homogeneous transformation, which renders the model to learn the geometric consistency between objects. The experimental results on the SUN RGB-D dataset demonstrate that our Explicit3D achieves better performance balance than the-state-of-the-art.
翻译:室内三维目标检测是单图像场景理解中的关键任务,深刻影响视觉推理中的空间认知基础。现有基于单图像的三维目标检测方法要么通过独立预测每个目标实现该目标,要么对所有可能目标进行隐式推理,未能利用目标之间的几何关系信息。针对这一问题,我们提出一种名为Explicit3D的动态稀疏图管道,该方法基于目标几何与语义特征。为兼顾效率,我们进一步定义关联分数,并设计一种新型动态剪枝算法,随后采用聚类采样方法实现稀疏场景图的生成与更新。此外,Explicit3D引入齐次矩阵,并定义新的相对损失和角点损失,以显式建模目标对之间的空间差异。与使用真值标签作为直接监督不同,我们的相对损失和角点损失来源于齐次变换,促使模型学习目标间的几何一致性。在SUN RGB-D数据集上的实验结果表明,Explicit3D在性能平衡方面优于现有最优方法。