3D scene understanding is a long-standing challenge in computer vision and a key component in enabling mixed reality, wearable computing, and embodied AI. Providing a solution to these applications requires a multifaceted approach that covers scene-centric, object-centric, as well as interaction-centric capabilities. While there exist numerous datasets approaching the former two problems, the task of understanding interactable and articulated objects is underrepresented and only partly covered by current works. In this work, we address this shortcoming and introduce (1) an expertly curated dataset in the Universal Scene Description (USD) format, featuring high-quality manual annotations, for instance, segmentation and articulation on 280 indoor scenes; (2) a learning-based model together with a novel baseline capable of predicting part segmentation along with a full specification of motion attributes, including motion type, articulated and interactable parts, and motion parameters; (3) a benchmark serving to compare upcoming methods for the task at hand. Overall, our dataset provides 8 types of annotations - object and part segmentations, motion types, movable and interactable parts, motion parameters, connectivity, and object mass annotations. With its broad and high-quality annotations, the data provides the basis for holistic 3D scene understanding models. All data is provided in the USD format, allowing interoperability and easy integration with downstream tasks. We provide open access to our dataset, benchmark, and method's source code.
翻译:三维场景理解是计算机视觉领域长期存在的挑战,也是实现混合现实、可穿戴计算和具身人工智能的关键组成部分。为这些应用提供解决方案需要采用多维度方法,涵盖以场景为中心、以对象为中心以及以交互为中心的能力。虽然已有大量数据集针对前两个问题,但对可交互和可铰接物体的理解任务在当前研究中代表性不足且仅部分覆盖。本研究针对这一不足提出:(1) 采用通用场景描述(USD)格式精心构建的数据集,包含280个室内场景的高质量人工标注,例如实例分割和铰接结构标注;(2) 基于学习的模型及新型基线方法,能够预测部件分割并完整描述运动属性,包括运动类型、可铰接与可交互部件以及运动参数;(3) 用于比较该任务未来方法的基准测试框架。整体而言,我们的数据集提供8类标注——对象与部件分割、运动类型、可移动与可交互部件、运动参数、连接关系及对象质量标注。凭借其广泛而高质量的标注,该数据为整体三维场景理解模型奠定了基础。所有数据均以USD格式提供,确保与下游任务的互操作性和易集成性。我们公开提供数据集、基准测试框架及方法源代码。