Traditional Visual Grounding (VG) predominantly relies on textual descriptions to localize objects, a paradigm that inherently struggles with linguistic ambiguity and often ignores non-verbal deictic cues prevalent in real-world interactions. In natural egocentric engagements, hand-pointing combined with speech forms the most intuitive referring mechanism. To bridge this gap, we introduce EgoPoint-Ground, the first large-scale multimodal dataset dedicated to egocentric deictic visual grounding. Comprising over \textbf{15k} interactive samples in complex scenes, the dataset provides rich, multi-grained annotations including hand-target bounding box pairs and dense semantic captions. We establish a comprehensive benchmark for hand-pointing referring expression resolution, evaluating a wide spectrum of mainstream Multimodal Large Language Models (MLLMs) and state-of-the-art VG architectures. Furthermore, we propose SV-CoT, a novel baseline framework that reformulates grounding as a structured inference process, synergizing gestural and linguistic cues through a Visual Chain-of-Thought paradigm. Extensive experiments demonstrate that SV-CoT achieves an $\textbf{11.7\%}$ absolute improvement over existing methods, effectively mitigating semantic ambiguity and advancing the capability of agents to comprehend multimodal physical intents. The dataset and code will be made publicly available.
翻译:传统视觉定位(Visual Grounding, VG)主要依赖文本描述来定位目标,这一范式本质上难以处理语言歧义,且往往忽略现实交互中普遍存在的非语言指向线索。在自然的自我中心互动中,手势指向与语音结合构成了最直观的指代机制。为弥补这一空白,我们提出了EgoPoint-Ground——首个专门用于自我中心指向视觉定位的大规模多模态数据集。该数据集包含复杂场景中超过**1.5万**个交互样本,提供丰富且多粒度的标注,包括手-目标边界框配对和密集语义描述。我们为手势指向指代表达消解建立了全面基准,评估了主流多模态大语言模型(MLLMs)及先进VG架构的广泛范围。此外,我们提出了SV-CoT,一种新颖的基线框架,将定位重新表述为结构化推理过程,通过视觉思维链(Visual Chain-of-Thought)范式协同手势与语言线索。大量实验表明,SV-CoT相比现有方法实现了**11.7%**的绝对提升,有效缓解了语义歧义,并提升了智能体理解多模态物理意图的能力。数据集和代码将公开提供。