3D scene understanding is an important task, and there has been a recent surge of research interest in aligning 3D representations of point clouds with text to empower embodied AI. However, due to the lack of comprehensive 3D benchmarks, the capabilities of 3D models in real-world scenes, particularly those that are challenging with subtly distinguished objects, remain insufficiently investigated. To facilitate a more thorough evaluation of 3D models' capabilities, we propose a scheme, ObjVariantEnsemble, to systematically introduce more scenes with specified object classes, colors, shapes, quantities, and spatial relationships to meet model evaluation needs. More importantly, we intentionally construct scenes with similar objects to a certain degree and design an LLM-VLM-cooperated annotator to capture key distinctions as annotations. The resultant benchmark can better challenge 3D models, reveal their shortcomings in understanding, and potentially aid in the further development of 3D models.
翻译:三维场景理解是一项重要任务,近期将点云的三维表示与文本对齐以赋能具身智能的研究兴趣激增。然而,由于缺乏全面的三维基准测试,三维模型在真实世界场景(尤其是包含细微可辨物体的挑战性场景)中的能力仍未得到充分探究。为促进对三维模型能力更彻底的评估,我们提出了一种方案 ObjVariantEnsemble,以系统性地引入更多具有指定物体类别、颜色、形状、数量和空间关系的场景,满足模型评估需求。更重要的是,我们有意构建了在某种程度上包含相似物体的场景,并设计了一个由大语言模型与视觉语言模型协同工作的标注器,以捕捉关键区别作为标注。由此产生的基准测试能更好地挑战三维模型,揭示其在理解上的不足,并可能有助于三维模型的进一步发展。