Current LLM assistants are powerful at answering questions, but they have limited access to the behavioral context that reveals when and where a user is struggling. We present a gaze-grounded multimodal LLM assistant that uses egocentric video with gaze overlays to identify likely points of difficulty and target follow-up retrospective assistance. We instantiate this vision in a controlled study (n=36) comparing the gaze-aware AI assistant to a text-only LLM assistant. Compared to a conventional LLM assistant, the gaze-aware assistant was rated as significantly more accurate and personalized in its assessments of users' reading behavior and significantly improved people's ability to recall information. Users spoke significantly fewer words with the gaze-aware assistant, indicating more efficient interactions. Qualitative results underscored both perceived benefits in comprehension and challenges when interpretations of gaze behaviors were inaccurate. Our findings suggest that gaze-aware LLM assistants can reason about cognitive needs to improve cognitive outcomes of users.
翻译:当前大语言模型(LLM)助手在回答问题方面表现强大,但缺乏对用户何时何地遭遇困难的行为背景信息的获取能力。我们提出了一种基于目光定位的通用LLM助手,该助手利用带有目光叠加的自我中心视频来识别可能的难点位置,并针对性地提供事后的回顾性辅助。我们通过一项受控实验(n=36)将该目光感知AI助手与纯文本LLM助手进行了比较。与传统LLM助手相比,目光感知助手在评估用户阅读行为方面被认为显著更准确、更具个性化,并显著提升了用户的信息回忆能力。用户在与目光感知助手交互时使用的词汇量显著减少,表明交互效率更高。定性结果既凸显了目光感知助手在理解方面的感知优势,也指出了目光行为解读不准确时面临的挑战。我们的研究结果表明,目光感知LLM助手能够推理用户的认知需求,从而改善用户的认知效果。