AGILE3D: Attention Guided Interactive Multi-object 3D Segmentation

During interactive segmentation, a model and a user work together to delineate objects of interest in a 3D point cloud. In an iterative process, the model assigns each data point to an object (or the background), while the user corrects errors in the resulting segmentation and feeds them back into the model. From a machine learning perspective the goal is to design the model and the feedback mechanism in a way that minimizes the required user input. The current best practice segments objects one at a time, and asks the user to provide positive clicks to indicate regions wrongly assigned to the background and negative clicks to indicate regions wrongly assigned to the object (foreground). Sequentially visiting objects is wasteful, since it disregards synergies between objects: a positive click for a given object can, by definition, serve as a negative click for nearby objects, moreover a direct competition between adjacent objects can speed up the identification of their common boundary. We introduce AGILE3D, an efficient, attention-based model that (1) supports simultaneous segmentation of multiple 3D objects, (2) yields more accurate segmentation masks with fewer user clicks, and (3) offers faster inference. We encode the point cloud into a latent feature representation, and view user clicks as queries and employ cross-attention to represent contextual relations between different click locations as well as between clicks and the 3D point cloud features. Every time new clicks are added, we only need to run a lightweight decoder that produces updated segmentation masks. In experiments with four different point cloud datasets, AGILE3D sets a new state of the art, moreover, we also verify its practicality in real-world setups with a real user study.

翻译：在交互式分割过程中，模型与用户协同工作，以在3D点云中勾勒出感兴趣的目标。在迭代过程中，模型将每个数据点分配给某个目标（或背景区域），而用户则纠正分割结果中的错误并将其反馈给模型。从机器学习角度来看，目标是最小化所需用户输入量的模型与反馈机制设计。当前最佳实践采用逐目标分割方式，要求用户通过正向点击指示误分配给背景的区域，通过负向点击指示误分配给目标（前景）的区域。顺序处理各目标缺乏效率，因其忽略了目标间的协同效应：针对某一目标的正向点击在定义上可作为邻近目标的负向点击，且相邻目标间的直接竞争可加速共同边界识别。我们提出AGILE3D——一种基于注意力机制的高效模型，其具备以下特性：(1)支持多个3D目标的同步分割，(2)以更少的用户点击生成更精确的分割掩码，(3)实现更快的推理速度。我们将点云编码为潜在特征表示，将用户点击视为查询，通过交叉注意力机制建模不同点击位置之间以及点击与3D点云特征之间的上下文关联。每次添加新点击时，仅需运行轻量级解码器即可生成更新的分割掩码。在四个不同点云数据集的实验中，AGILE3D达到了新的最优水平，同时我们通过真实场景用户研究验证了其实用性。