Egocentric AI agents, such as smart glasses, rely on pointing gestures to resolve referential ambiguities in natural language commands. However, despite advancements in Multimodal Large Language Models (MLLMs), current systems often fail to precisely ground the spatial semantics of pointing. Instead, they rely on spurious correlations with visual proximity or object saliency, a phenomenon we term "Referential Hallucination." To address this gap, we introduce EgoPoint-Bench, a comprehensive question-answering benchmark designed to evaluate and enhance multimodal pointing reasoning in egocentric views. Comprising over 11k high-fidelity simulated and real-world samples, the benchmark spans five evaluation dimensions and three levels of referential complexity. Extensive experiments demonstrate that while state-of-the-art proprietary and open-source models struggle with egocentric pointing, models fine-tuned on our synthetic data achieve significant performance gains and robust sim-to-real generalization. This work highlights the importance of spatially aware supervision and offers a scalable path toward precise egocentric AI assistants. Project page: https://guyyyug.github.io/EgoPoint-Bench/
翻译:第一人称视角AI智能体(如智能眼镜)依赖指向手势来消解自然语言指令中的指代歧义。然而,尽管多模态大语言模型(MLLMs)取得显著进展,当前系统仍难以精准定位指向动作的空间语义,反而依赖于视觉邻近性或物体显著性等虚假关联——我们将这一现象称为"指代幻觉"。为弥补这一缺失,我们提出EgoPoint-Bench,一个面向第一人称视角多模态指向推理的综合问答基准。该基准包含超过1.1万条高保真仿真与真实世界样本,覆盖五个评估维度及三级指代复杂度。大量实验表明,尽管当前顶尖的商业与开源模型在第一人称指向任务中表现不佳,基于合成数据微调的模型不仅取得了显著性能提升,还展现出强大的仿真到真实场景泛化能力。本工作揭示了空间感知监督的重要性,并为精准的第一人称AI智能体铺就了一条可扩展的发展路径。项目主页:https://guyyyug.github.io/EgoPoint-Bench/