Grounding object properties and relations in 3D scenes is a prerequisite for a wide range of artificial intelligence tasks, such as visually grounded dialogues and embodied manipulation. However, the variability of the 3D domain induces two fundamental challenges: 1) the expense of labeling and 2) the complexity of 3D grounded language. Hence, essential desiderata for models are to be data-efficient, generalize to different data distributions and tasks with unseen semantic forms, as well as ground complex language semantics (e.g., view-point anchoring and multi-object reference). To address these challenges, we propose NS3D, a neuro-symbolic framework for 3D grounding. NS3D translates language into programs with hierarchical structures by leveraging large language-to-code models. Different functional modules in the programs are implemented as neural networks. Notably, NS3D extends prior neuro-symbolic visual reasoning methods by introducing functional modules that effectively reason about high-arity relations (i.e., relations among more than two objects), key in disambiguating objects in complex 3D scenes. Modular and compositional architecture enables NS3D to achieve state-of-the-art results on the ReferIt3D view-dependence task, a 3D referring expression comprehension benchmark. Importantly, NS3D shows significantly improved performance on settings of data-efficiency and generalization, and demonstrate zero-shot transfer to an unseen 3D question-answering task.
翻译:将物体属性及其三维场景中的关系进行基础构建,是视觉对话、具身操作等广泛人工智能任务的前提。然而,三维领域的变异性带来了两大根本挑战:1)标注成本高昂;2)三维基础语言复杂度高。因此,模型需具备数据高效性、泛化到不同数据分布及包含未见语义形式任务的能力,并能够基础化复杂语言语义(例如视角锚定与多对象指代)。为应对这些挑战,我们提出NS3D——一个用于三维基础构建的神经符号框架。NS3D利用大规模语言到代码模型,将语言转化为具有层次结构的程序。程序中的不同功能模块以神经网络实现。值得注意的是,NS3D通过引入高效推理高元关系(即涉及两个以上对象的关系)的功能模块,扩展了先前的神经符号视觉推理方法——这种关系在复杂三维场景的对象消歧中至关重要。模块化与组合式架构使NS3D在三维指代表达理解基准ReferIt3D的视角依赖任务上取得了最先进结果。更重要的是,NS3D在数据高效性与泛化设置中展现出显著性能提升,并实现了对未见三维问答任务的零样本迁移。