3D visual grounding is the task of localizing the object in a 3D scene which is referred by a description in natural language. With a wide range of applications ranging from autonomous indoor robotics to AR/VR, the task has recently risen in popularity. A common formulation to tackle 3D visual grounding is grounding-by-detection, where localization is done via bounding boxes. However, for real-life applications that require physical interactions, a bounding box insufficiently describes the geometry of an object. We therefore tackle the problem of dense 3D visual grounding, i.e. referral-based 3D instance segmentation. We propose a dense 3D grounding network ConcreteNet, featuring four novel stand-alone modules that aim to improve grounding performance for challenging repetitive instances, i.e. instances with distractors of the same semantic class. First, we introduce a bottom-up attentive fusion module that aims to disambiguate inter-instance relational cues, next, we construct a contrastive training scheme to induce separation in the latent space, we then resolve view-dependent utterances via a learned global camera token, and finally we employ multi-view ensembling to improve referred mask quality. ConcreteNet ranks 1st on the challenging ScanRefer online benchmark and has won the ICCV 3rd Workshop on Language for 3D Scenes "3D Object Localization" challenge. Our code is available at ouenal.github.io/concretenet/.
翻译:三维视觉定位任务旨在根据自然语言描述,在三维场景中定位目标物体。该任务在自主室内机器人到增强现实/虚拟现实等广泛领域具有重要应用价值,近年来受到广泛关注。当前主流的三维视觉定位方法采用检测式定位框架,即通过边界框实现物体定位。然而,在需要物理交互的实际应用场景中,边界框难以充分描述物体的几何特征。为此,我们研究密集三维视觉定位问题,即基于语言描述的实例分割任务。我们提出密集三维定位网络ConcreteNet,该网络包含四个独立设计的新型模块,专门针对具有相同语义类别干扰物的重复性实例提升定位性能。首先,我们提出自底向上的注意力融合模块以消除实例间关联线索的歧义;其次,构建对比训练机制以在隐空间实现特征分离;随后,通过可学习的全局相机标记解决视角依赖性表述问题;最后采用多视角集成策略提升目标掩码质量。ConcreteNet在权威的ScanRefer在线基准测试中位列第一,并荣获ICCV第三届三维场景语言研讨会"三维物体定位"挑战赛冠军。代码已开源:ouenal.github.io/concretenet/。