Despite advances in multimodal AI, current vision-based assistants often remain inefficient in collaborative tasks. We identify two key gulfs: a communication gulf, where users must translate rich parallel intentions into verbal commands due to the channel mismatch , and an understanding gulf, where AI struggles to interpret subtle embodied cues. To address these, we propose Eye2Eye, a framework that leverages first-person perspective as a channel for human-AI cognitive alignment. It integrates three components: (1) joint attention coordination for fluid focus alignment, (2) revisable memory to maintain evolving common ground, and (3) reflective feedback allowing users to clarify and refine AI's understanding. We implement this framework in an AR prototype and evaluate it through a user study and a post-hoc pipeline evaluation. Results show that Eye2Eye significantly reduces task completion time and interaction load while increasing trust, demonstrating its components work in concert to improve collaboration.
翻译:尽管多模态人工智能取得了进展,但当前基于视觉的助手在协作任务中往往效率低下。我们识别出两个关键鸿沟:一是沟通鸿沟,由于通道不匹配,用户必须将丰富的并行意图转化为口头指令;二是理解鸿沟,人工智能难以解读微妙的具身线索。为解决这些问题,我们提出了Eye2Eye框架,该框架利用第一人称视角作为人机认知对齐的通道。它整合了三个组成部分:(1) 用于流畅焦点对齐的联合注意力协调机制,(2) 维护动态共同基础的可修正记忆系统,以及(3) 允许用户澄清和优化AI理解的反思性反馈机制。我们在一个增强现实原型中实现了该框架,并通过用户研究和事后流程评估对其进行了验证。结果表明,Eye2Eye显著减少了任务完成时间和交互负荷,同时提高了信任度,证明其各组件协同工作以改善协作。