Point scene understanding is a challenging task to process real-world scene point cloud, which aims at segmenting each object, estimating its pose, and reconstructing its mesh simultaneously. Recent state-of-the-art method first segments each object and then processes them independently with multiple stages for the different sub-tasks. This leads to a complex pipeline to optimize and makes it hard to leverage the relationship constraints between multiple objects. In this work, we propose a novel Disentangled Object-Centric TRansformer (DOCTR) that explores object-centric representation to facilitate learning with multiple objects for the multiple sub-tasks in a unified manner. Each object is represented as a query, and a Transformer decoder is adapted to iteratively optimize all the queries involving their relationship. In particular, we introduce a semantic-geometry disentangled query (SGDQ) design that enables the query features to attend separately to semantic information and geometric information relevant to the corresponding sub-tasks. A hybrid bipartite matching module is employed to well use the supervisions from all the sub-tasks during training. Qualitative and quantitative experimental results demonstrate that our method achieves state-of-the-art performance on the challenging ScanNet dataset. Code is available at https://github.com/SAITPublic/DOCTR.
翻译:点云场景理解是一项处理真实世界场景点云的挑战性任务,旨在同时分割每个对象、估计其姿态并重建其网格。当前最先进的方法首先分割每个对象,然后通过多个阶段针对不同子任务独立处理它们。这导致优化流程复杂,且难以利用多个对象之间的约束关系。本文提出了一种新颖的去纠缠以对象为中心的Transformer(DOCTR),该模型探索以对象为中心的表示,以统一方式促进多对象学习并处理多个子任务。每个对象表示为一个查询,通过Transformer解码器迭代优化所有查询及其相互关系。特别地,我们引入了一种语义-几何解耦查询(SGDQ)设计,使查询特征能够分别关注与对应子任务相关的语义信息和几何信息。采用混合二分图匹配模块在训练过程中有效利用所有子任务的监督信息。定性与定量实验结果表明,我们的方法在具有挑战性的ScanNet数据集上达到了最先进性能。代码已开源:https://github.com/SAITPublic/DOCTR。