3D scene understanding has been transformed by open-vocabulary language models that enable interaction via natural language. However, at present the evaluation of these representations is limited to datasets with closed-set semantics that do not capture the richness of language. This work presents OpenLex3D, a dedicated benchmark for evaluating 3D open-vocabulary scene representations. OpenLex3D provides entirely new label annotations for scenes from Replica, ScanNet++, and HM3D, which capture real-world linguistic variability by introducing synonymical object categories and additional nuanced descriptions. Our label sets provide 13 times more labels per scene than the original datasets. By introducing an open-set 3D semantic segmentation task and an object retrieval task, we evaluate various existing 3D open-vocabulary methods on OpenLex3D, showcasing failure cases, and avenues for improvement. Our experiments provide insights on feature precision, segmentation, and downstream capabilities. The benchmark is publicly available at: https://openlex3d.github.io/.
翻译:开放词汇语言模型通过自然语言实现交互,彻底改变了3D场景理解领域。然而,目前对这些表示形式的评估仅限于具有封闭集语义的数据集,未能充分捕捉语言的丰富性。本研究提出了OpenLex3D,一个专门用于评估3D开放词汇场景表示的基准。OpenLex3D为来自Replica、ScanNet++和HM3D的场景提供了全新的标签标注,通过引入同义对象类别和额外的细致描述,捕捉了真实世界的语言多样性。我们的标签集为每个场景提供的标签数量是原始数据集的13倍。通过引入开放集3D语义分割任务和对象检索任务,我们在OpenLex3D上评估了多种现有的3D开放词汇方法,展示了其失败案例和改进方向。我们的实验为特征精度、分割能力和下游应用潜力提供了深入见解。该基准已在 https://openlex3d.github.io/ 公开提供。