A unified model for 3D vision-language (3D-VL) understanding is expected to take various scene representations and perform a wide range of tasks in a 3D scene. However, a considerable gap exists between existing methods and such a unified model, due to the independent application of representation and insufficient exploration of 3D multi-task training. In this paper, we introduce PQ3D, a unified model capable of using Promptable Queries to tackle a wide range of 3D-VL tasks, from low-level instance segmentation to high-level reasoning and planning. This is achieved through three key innovations: (1) unifying various 3D scene representations (i.e., voxels, point clouds, multi-view images) into a shared 3D coordinate space by segment-level grouping, (2) an attention-based query decoder for task-specific information retrieval guided by prompts, and (3) universal output heads for different tasks to support multi-task training. Tested across ten diverse 3D-VL datasets, PQ3D demonstrates impressive performance on these tasks, setting new records on most benchmarks. Particularly, PQ3D improves the state-of-the-art on ScanNet200 by 4.9% (AP25), ScanRefer by 5.4% ([email protected]), Multi3DRefer by 11.7% ([email protected]), and Scan2Cap by 13.4% ([email protected]). Moreover, PQ3D supports flexible inference with individual or combined forms of available 3D representations, e.g., solely voxel input.
翻译:一个统一的三维视觉-语言(3D-VL)理解模型应能处理多种场景表征并执行三维场景中的广泛任务。然而,由于现有方法在表征应用上的孤立性以及对三维多任务训练的探索不足,当前方法与这种统一模型之间仍存在显著差距。本文提出PQ3D模型,该模型能够利用可提示查询处理从低层实例分割到高层推理与规划的广泛3D-VL任务。这一目标通过三项关键创新实现:(1)通过片段级分组将多种三维场景表征(即体素、点云、多视角图像)统一至共享的三维坐标空间;(2)基于注意力的查询解码器,可根据提示检索任务特定信息;(3)适用于不同任务的通用输出头,以支持多任务训练。在十个异构3D-VL数据集上的测试表明,PQ3D在这些任务中展现出卓越性能,在多数基准测试中创造了新纪录。具体而言,PQ3D将ScanNet200的AP25指标提升4.9%,ScanRefer的[email protected]提升5.4%,Multi3DRefer的[email protected]提升11.7%,Scan2Cap的[email protected]提升13.4%。此外,PQ3D支持灵活推理,可兼容单独或组合形式的三维表征输入(例如仅使用体素输入)。