Embodied Reference Understanding requires identifying a target object in a visual scene based on both language instructions and pointing cues. While prior works have shown progress in open-vocabulary object detection, they often fail in ambiguous scenarios where multiple candidate objects exist in the scene. To address these challenges, we propose a novel ERU framework that jointly leverages LLM-based data augmentation, depth-map modality, and a depth-aware decision module. This design enables robust integration of linguistic and embodied cues, improving disambiguation in complex or cluttered environments. Experimental results on two datasets demonstrate that our approach significantly outperforms existing baselines, achieving more accurate and reliable referent detection.
翻译:具身指代理解要求根据语言指令和指向线索在视觉场景中识别目标物体。尽管先前工作在开放词汇目标检测方面取得了进展,但在场景中存在多个候选物体的歧义情境下,这些方法往往失效。为解决上述挑战,我们提出了一种新颖的ERU框架,该框架联合利用基于大语言模型的数据增强、深度图模态以及深度感知决策模块。该设计能够实现语言线索与具身线索的鲁棒整合,提升复杂或杂乱环境中的消歧能力。在两个数据集上的实验结果表明,我们的方法显著优于现有基线模型,实现了更准确、更可靠的指代物检测。