3D dense captioning requires a model to translate its understanding of an input 3D scene into several captions associated with different object regions. Existing methods adopt a sophisticated "detect-then-describe" pipeline, which builds explicit relation modules upon a 3D detector with numerous hand-crafted components. While these methods have achieved initial success, the cascade pipeline tends to accumulate errors because of duplicated and inaccurate box estimations and messy 3D scenes. In this paper, we first propose Vote2Cap-DETR, a simple-yet-effective transformer framework that decouples the decoding process of caption generation and object localization through parallel decoding. Moreover, we argue that object localization and description generation require different levels of scene understanding, which could be challenging for a shared set of queries to capture. To this end, we propose an advanced version, Vote2Cap-DETR++, which decouples the queries into localization and caption queries to capture task-specific features. Additionally, we introduce the iterative spatial refinement strategy to vote queries for faster convergence and better localization performance. We also insert additional spatial information to the caption head for more accurate descriptions. Without bells and whistles, extensive experiments on two commonly used datasets, ScanRefer and Nr3D, demonstrate Vote2Cap-DETR and Vote2Cap-DETR++ surpass conventional "detect-then-describe" methods by a large margin. Codes will be made available at https://github.com/ch3cook-fdu/Vote2Cap-DETR.
翻译:三维密集描述生成要求模型能够将其对输入三维场景的理解转化为多个与不同物体区域相关的描述。现有方法采用复杂的"先检测后描述"流水线,在带有大量手工设计组件的三维检测器基础上构建显式关系模块。尽管这些方法已取得初步成功,但由于重复且不准确的框估计以及杂乱的三维场景,级联流水线容易累积误差。本文首先提出Vote2Cap-DETR,这是一种简洁而有效的Transformer框架,通过并行解码将描述生成和物体定位的解码过程解耦。此外,我们认为物体定位和描述生成需要不同层次的场景理解,这对共享的查询集合来说难以捕捉。为此,我们提出进阶版本Vote2Cap-DETR++,将查询解耦为定位查询和描述查询,以捕获任务特定特征。同时,我们引入迭代空间细化策略来投票查询,以实现更快的收敛和更好的定位性能。我们还在描述头中插入额外的空间信息,以生成更准确的描述。无需繁琐的附加组件,在ScanRefer和Nr3D这两个常用数据集上的大量实验表明,Vote2Cap-DETR和Vote2Cap-DETR++大幅超越传统的"先检测后描述"方法。代码将在https://github.com/ch3cook-fdu/Vote2Cap-DETR 开放。