Embodied reference understanding is crucial for intelligent agents to predict referents based on human intention through gesture signals and language descriptions. This paper introduces the Attention-Dynamic DINO, a novel framework designed to mitigate misinterpretations of pointing gestures across various interaction contexts. Our approach integrates visual and textual features to simultaneously predict the target object's bounding box and the attention source in pointing gestures. Leveraging the distance-aware nature of nonverbal communication in visual perspective taking, we extend the virtual touch line mechanism and propose an attention-dynamic touch line to represent referring gesture based on interactive distances. The combination of this distance-aware approach and independent prediction of the attention source, enhances the alignment between objects and the gesture represented line. Extensive experiments on the YouRefIt dataset demonstrate the efficacy of our gesture information understanding method in significantly improving task performance. Our model achieves 76.4% accuracy at the 0.25 IoU threshold and, notably, surpasses human performance at the 0.75 IoU threshold, marking a first in this domain. Comparative experiments with distance-unaware understanding methods from previous research further validate the superiority of the Attention-Dynamic Touch Line across diverse contexts.
翻译:具身指称理解对于智能体通过手势信号和语言描述基于人类意图预测指称对象至关重要。本文提出注意力动态DINO,这是一种旨在减少不同交互场景中指向手势误读的新型框架。我们的方法整合视觉与文本特征,以同时预测目标物体的边界框及指向手势中的注意力源。利用视觉透视采集中非语言交流的距离感知特性,我们扩展了虚拟接触线机制,并提出基于交互距离表示指称手势的注意力动态接触线。这种距离感知方法与注意力源独立预测的结合,增强了物体与手势表征线之间的对齐效果。在YouRefIt数据集上的大量实验表明,我们的手势信息理解方法能显著提升任务性能。该模型在0.25 IoU阈值下达到76.4%的准确率,尤其在0.75 IoU阈值下超越人类表现,这在该领域尚属首次。与以往研究中无视距离的理解方法进行的对比实验,进一步验证了注意力动态接触线在不同场景下的优越性。