Current approaches for 3D scene graph prediction rely on labeled datasets to train models for a fixed set of known object classes and relationship categories. We present Open3DSG, an alternative approach to learn 3D scene graph prediction in an open world without requiring labeled scene graph data. We co-embed the features from a 3D scene graph prediction backbone with the feature space of powerful open world 2D vision language foundation models. This enables us to predict 3D scene graphs from 3D point clouds in a zero-shot manner by querying object classes from an open vocabulary and predicting the inter-object relationships from a grounded LLM with scene graph features and queried object classes as context. Open3DSG is the first 3D point cloud method to predict not only explicit open-vocabulary object classes, but also open-set relationships that are not limited to a predefined label set, making it possible to express rare as well as specific objects and relationships in the predicted 3D scene graph. Our experiments show that Open3DSG is effective at predicting arbitrary object classes as well as their complex inter-object relationships describing spatial, supportive, semantic and comparative relationships.
翻译:当前三维场景图预测方法依赖标注数据集训练模型,以处理固定已知对象类别和关系类别。我们提出Open3DSG,一种无需标注场景图数据即可在开放世界中学习三维场景图预测的替代方法。该方法将三维场景图预测骨干网络的特征与强大的开放世界二维视觉语言基础模型的特征空间进行联合嵌入。通过利用开放词汇查询对象类别,并以场景图特征及查询到的对象类别为上下文,借助接地大语言模型(LLM)预测对象间关系,从而以零样本方式从三维点云中预测三维场景图。Open3DSG是首个不仅能预测显式开放词汇对象类别,还能预测不受预定义标签集限制的开放集关系的三维点云方法,这使得在预测的三维场景图中能够表达罕见及特定对象和关系。实验表明,Open3DSG能够有效预测任意对象类别及其描述空间、支撑、语义和比较关系的复杂对象间关系。