3D visual grounding is a challenging task that often requires direct and dense supervision, notably the semantic label for each object in the scene. In this paper, we instead study the naturally supervised setting that learns from only 3D scene and QA pairs, where prior works underperform. We propose the Language-Regularized Concept Learner (LARC), which uses constraints from language as regularization to significantly improve the accuracy of neuro-symbolic concept learners in the naturally supervised setting. Our approach is based on two core insights: the first is that language constraints (e.g., a word's relation to another) can serve as effective regularization for structured representations in neuro-symbolic models; the second is that we can query large language models to distill such constraints from language properties. We show that LARC improves performance of prior works in naturally supervised 3D visual grounding, and demonstrates a wide range of 3D visual reasoning capabilities-from zero-shot composition, to data efficiency and transferability. Our method represents a promising step towards regularizing structured visual reasoning frameworks with language-based priors, for learning in settings without dense supervision.
翻译:3D视觉定位是一项具有挑战性的任务,通常需要直接且密集的监督,特别是场景中每个对象的语义标签。本文研究仅从3D场景和问答对中学习的自然监督设置,而先前的工作在此设置下表现欠佳。我们提出语言正则化概念学习器(LARC),利用语言约束作为正则化手段,显著提升神经符号概念学习器在自然监督设置下的准确率。该方法基于两个核心见解:第一,语言约束(如一个词与另一个词的关系)可作为神经符号模型中结构化表示的有效正则化;第二,我们可以通过查询大型语言模型从语言属性中提炼此类约束。实验表明,LARC在自然监督的3D视觉定位中提升了先前工作的性能,并展示了广泛的3D视觉推理能力——从零样本组合到数据效率和可迁移性。本方法标志着利用语言先验正则化结构化视觉推理框架,以实现在无密集监督场景下学习的重要进展。