3D visual grounding is the task of localizing the object in a 3D scene which is referred by a description in natural language. With a wide range of applications ranging from autonomous indoor robotics to AR/VR, the task has recently risen in popularity. A common formulation to tackle 3D visual grounding is grounding-by-detection, where localization is done via bounding boxes. However, for real-life applications that require physical interactions, a bounding box insufficiently describes the geometry of an object. We therefore tackle the problem of dense 3D visual grounding, i.e. referral-based 3D instance segmentation. We propose a dense 3D grounding network ConcreteNet, featuring four novel stand-alone modules that aim to improve grounding performance for challenging repetitive instances, i.e. instances with distractors of the same semantic class. First, we introduce a bottom-up attentive fusion module that aims to disambiguate inter-instance relational cues, next, we construct a contrastive training scheme to induce separation in the latent space, we then resolve view-dependent utterances via a learned global camera token, and finally we employ multi-view ensembling to improve referred mask quality. ConcreteNet ranks 1st on the challenging ScanRefer online benchmark and has won the ICCV 3rd Workshop on Language for 3D Scenes "3D Object Localization" challenge.
翻译:三维视觉定位的任务是在三维场景中定位由自然语言描述所指代的对象。该任务在从室内自主机器人到增强现实/虚拟现实的广泛应用中日益受到关注。处理三维视觉定位的常见方法是基于检测的定位,即通过边界框进行定位。然而,对于需要物理交互的实际应用,边界框不足以描述对象的几何形状。因此,我们致力于解决密集三维视觉定位问题,即基于指代的三维实例分割。我们提出了密集三维定位网络ConcreteNet,其包含四个新颖的独立模块,旨在提升对具有挑战性的重复实例(即存在相同语义类别干扰物的实例)的定位性能。首先,我们引入了一个自底向上的注意力融合模块,旨在消除实例间关系线索的歧义;其次,我们构建了一个对比训练方案以在潜在空间中诱导分离;接着,我们通过学习的全局相机标记来解决视图依赖的表述;最后,我们采用多视图集成以提高指代掩码的质量。ConcreteNet在具有挑战性的ScanRefer在线基准测试中排名第一,并赢得了ICCV第三届“三维场景语言”研讨会中的“三维对象定位”挑战赛。