We address the problem of Embodied Reference Understanding, which involves predicting the object that a person in the scene is referring to through both pointing gesture and language. Accurately identifying the referent requires multimodal understanding: integrating textual instructions, visual pointing, and scene context. However, existing methods often struggle to effectively leverage visual clues for disambiguation. We also observe that, while the referent is often aligned with the head-to-fingertip line, it occasionally aligns more closely with the wrist-to-fingertip line. Therefore, relying on a single line assumption can be overly simplistic and may lead to suboptimal performance. To address this, we propose a dual-model framework, where one model learns from the head-to-fingertip direction and the other from the wrist-to-fingertip direction. We further introduce a Gaussian ray heatmap representation of these lines and use them as input to provide a strong supervisory signal that encourages the model to better attend to pointing cues. To combine the strengths of both models, we present the CLIP-Aware Pointing Ensemble module, which performs a hybrid ensemble based on CLIP features. Additionally, we propose an object center prediction head as an auxiliary task to further enhance referent localization. We validate our approach through extensive experiments and analysis on the benchmark YouRefIt dataset, achieving an improvement of approximately 4 mAP at the 0.25 IoU threshold. We further evaluate our approach on the CAESAR and ISL Pointing datasets.
翻译:我们研究具身指称理解问题,该问题旨在通过场景中人物的指向手势和语言来预测其指称的对象。准确识别指称对象需要多模态理解:整合文本指令、视觉指向和场景上下文。然而,现有方法往往难以有效利用视觉线索进行消歧。我们还观察到,虽然指称对象常与头部到指尖的连线方向一致,但有时更接近手腕到指尖的连线方向。因此,仅依赖单一连线假设可能过于简化,并导致次优性能。为此,我们提出一种双模型框架,其中一个模型学习头部到指尖方向,另一个模型学习手腕到指尖方向。我们进一步引入这些连线的高斯射线热图表示,并将其作为输入以提供强监督信号,促使模型更好地关注指向线索。为结合两个模型的优势,我们提出了CLIP感知指向集成模块,该模块基于CLIP特征进行混合集成。此外,我们提出一个对象中心预测头作为辅助任务,以进一步增强指称定位能力。我们在基准数据集YouRefIt上通过大量实验和分析验证了我们的方法,在0.25 IoU阈值下实现了约4 mAP的提升。我们进一步在CAESAR和ISL Pointing数据集上评估了我们的方法。