The existing works on object-level language grounding with 3D objects mostly focus on improving performance by utilizing the off-the-shelf pre-trained models to capture features, such as viewpoint selection or geometric priors. However, they have failed to consider exploring the cross-modal representation of language-vision alignment in the cross-domain field. To answer this problem, we propose a novel method called Domain Adaptation for Language Grounding (DA4LG) with 3D objects. Specifically, the proposed DA4LG consists of a visual adapter module with multi-task learning to realize vision-language alignment by comprehensive multimodal feature representation. Experimental results demonstrate that DA4LG competitively performs across visual and non-visual language descriptions, independent of the completeness of observation. DA4LG achieves state-of-the-art performance in the single-view setting and multi-view setting with the accuracy of 83.8% and 86.8% respectively in the language grounding benchmark SNARE. The simulation experiments show the well-practical and generalized performance of DA4LG compared to the existing methods. Our project is available at https://sites.google.com/view/da4lg.
翻译:现有基于三维物体的物体级语言接地研究大多侧重于利用现成的预训练模型(如视角选择或几何先验)捕获特征以提升性能,但未能探索跨领域环境下语言-视觉对齐的跨模态表征。为解决该问题,我们提出一种名为语言接地域适应(DA4LG)的新方法。具体而言,所提出的DA4LG通过包含多任务学习的视觉适配器模块,实现基于全面多模态特征表征的视觉-语言对齐。实验结果表明,DA4LG在视觉与非视觉语言描述任务中均表现出竞争力,且不受观测完整性的影响。在语言接地基准测试SNARE中,DA4LG在单视角与多视角设置下分别达到83.8%与86.8%的准确率,实现了最先进的性能。仿真实验表明,相较于现有方法,DA4LG具有更优的实用性与泛化性能。项目地址:https://sites.google.com/view/da4lg。