Open-vocabulary 3D scene understanding (OV-3D) aims to localize and classify novel objects beyond the closed object classes. However, existing approaches and benchmarks primarily focus on the open vocabulary problem within the context of object classes, which is insufficient to provide a holistic evaluation to what extent a model understands the 3D scene. In this paper, we introduce a more challenging task called Generalized Open-Vocabulary 3D Scene Understanding (GOV-3D) to explore the open vocabulary problem beyond object classes. It encompasses an open and diverse set of generalized knowledge, expressed as linguistic queries of fine-grained and object-specific attributes. To this end, we contribute a new benchmark named OpenScan, which consists of 3D object attributes across eight representative linguistic aspects, including affordance, property, material, and more. We further evaluate state-of-the-art OV-3D methods on our OpenScan benchmark, and discover that these methods struggle to comprehend the abstract vocabularies of the GOV-3D task, a challenge that cannot be addressed by simply scaling up object classes during training. We highlight the limitations of existing methodologies and explore a promising direction to overcome the identified shortcomings. Data and code are available at https://github.com/YoujunZhao/OpenScan
翻译:开放词汇三维场景理解(OV-3D)旨在定位并分类超出封闭对象类别范围的新颖物体。然而,现有方法与基准主要聚焦于对象类别层面的开放词汇问题,这不足以全面评估模型对三维场景的理解程度。本文提出一项更具挑战性的任务——广义开放词汇三维场景理解(GOV-3D),以探索超越对象类别的开放词汇问题。该任务涵盖一组开放且多样化的广义知识,这些知识通过细粒度及对象特定属性的语言查询进行表达。为此,我们构建了名为OpenScan的新基准,该基准包含涵盖八个代表性语言维度的三维物体属性,包括功能可供性、特性、材质等。我们在OpenScan基准上进一步评估了当前最先进的OV-3D方法,发现这些方法难以理解GOV-3D任务中的抽象词汇,这一挑战无法通过单纯扩大训练时的对象类别规模来解决。我们揭示了现有方法的局限性,并探索了克服已发现缺陷的潜在研究方向。数据与代码公开于https://github.com/YoujunZhao/OpenScan。