Reconstructing compositional 3D representations of scenes, where each object is represented with its own 3D model, is a highly desirable capability in robotics and augmented reality. However, most existing methods rely heavily on strong appearance priors for object discovery, therefore only working on those classes of objects on which the method has been trained, or do not allow for object manipulation, which is necessary to scan objects fully and to guide object discovery in challenging scenarios. We address these limitations with a novel interaction-guided and class-agnostic method based on object displacements that allows a user to move around a scene with an RGB-D camera, hold up objects, and finally outputs one 3D model per held-up object. Our main contribution to this end is a novel approach to detecting user-object interactions and extracting the masks of manipulated objects. On a custom-captured dataset, our pipeline discovers manipulated objects with 78.3% precision at 100% recall and reconstructs them with a mean chamfer distance of 0.90cm. Compared to Co-Fusion, the only comparable interaction-based and class-agnostic baseline, this corresponds to a reduction in chamfer distance of 73% while detecting 99% fewer false positives.
翻译:从场景中重建组合式三维表示(其中每个物体均以独立三维模型呈现)是机器人与增强现实领域亟需实现的关键能力。然而,现有方法大多严重依赖外观先验进行物体发现,仅能处理训练过的特定物体类别;或无法支持物体操控——而这对实现完整物体扫描及在复杂场景中引导物体发现至关重要。我们提出一种基于物体位移的新型交互引导且类别无关的方法来解决这些局限:用户可持RGB-D相机在场景中移动并举起物体,系统最终为每个被举起的物体输出独立三维模型。为此,我们的核心贡献在于提出一种检测用户-物体交互并提取被操控物体掩模的创新方法。在自定义采集的数据集上,我们的流程以100%召回率实现78.3%精确度的被操控物体发现,并以0.90厘米的平均倒角距离完成重建。相较于唯一可比的交互式类别无关基线方法Co-Fusion,我们的方法将倒角距离降低了73%,同时减少了99%的误检率。