3D multimodal question answering (MQA) plays a crucial role in scene understanding by enabling intelligent agents to comprehend their surroundings in 3D environments. While existing research has primarily focused on indoor household tasks and outdoor roadside autonomous driving tasks, there has been limited exploration of city-level scene understanding tasks. Furthermore, existing research faces challenges in understanding city scenes, due to the absence of spatial semantic information and human-environment interaction information at the city level.To address these challenges, we investigate 3D MQA from both dataset and method perspectives. From the dataset perspective, we introduce a novel 3D MQA dataset named City-3DQA for city-level scene understanding, which is the first dataset to incorporate scene semantic and human-environment interactive tasks within the city. From the method perspective, we propose a Scene graph enhanced City-level Understanding method (Sg-CityU), which utilizes the scene graph to introduce the spatial semantic. A new benchmark is reported and our proposed Sg-CityU achieves accuracy of 63.94 % and 63.76 % in different settings of City-3DQA. Compared to indoor 3D MQA methods and zero-shot using advanced large language models (LLMs), Sg-CityU demonstrates state-of-the-art (SOTA) performance in robustness and generalization.
翻译:三维多模态问答(MQA)通过使智能体能够理解三维环境中的周围环境,在场景理解中发挥着关键作用。尽管现有研究主要集中于室内家居任务和室外路边自动驾驶任务,但对城市级场景理解任务的探索仍然有限。此外,由于缺乏城市级别的空间语义信息和人-环境交互信息,现有研究在理解城市场景方面面临挑战。为应对这些挑战,我们从数据集和方法两个角度研究三维MQA。从数据集角度,我们引入了一个名为City-3DQA的新型三维MQA数据集,用于城市级场景理解,这是首个在城市范围内融合场景语义和人-环境交互任务的数据集。从方法角度,我们提出了一种场景图增强的城市级理解方法(Sg-CityU),该方法利用场景图引入空间语义信息。我们报告了新的基准测试结果,我们提出的Sg-CityU在City-3DQA的不同设置下分别达到了63.94%和63.76%的准确率。与室内三维MQA方法以及使用先进大语言模型(LLMs)的零样本方法相比,Sg-CityU在鲁棒性和泛化性方面展现了最先进的(SOTA)性能。