Eye-hand coordinated interaction is becoming a mainstream interaction modality in Virtual Reality (VR) user interfaces.Current paradigms for this multimodal interaction require users to learn predefined gestures and memorize multiple gesture-task associations, which can be summarized as an ``Operation-to-Intent" paradigm. This paradigm increases users' learning costs and has low interaction error tolerance. In this paper, we propose SIAgent, a novel "Intent-to-Operation" framework allowing users to express interaction intents through natural eye-hand motions based on common sense and habits. Our system features two main components: (1) intent recognition that translates spatial interaction data into natural language and infers user intent, and (2) agent-based execution that generates an agent to execute corresponding tasks. This eliminates the need for gesture memorization and accommodates individual motion preferences with high error tolerance. We conduct two user studies across over 60 interaction tasks, comparing our method with two "Operation-to-Intent" techniques. Results show our method achieves higher intent recognition accuracy than gaze + pinch interaction (97.2% vs 93.1%) while reducing arm fatigue and improving usability, and user preference. Another study verifies the function of eye gaze and hand motion channels in intent recognition. Our work offers valuable insights into enhancing VR interaction intelligence through intent-driven design. Our source code and LLM prompts will be made available upon publication.
翻译:眼手协同交互正逐渐成为虚拟现实用户界面的主流交互模式。当前这种多模态交互范式要求用户学习预定义手势并记忆多种手势-任务关联,可概括为"操作到意图"范式。该范式增加了用户的学习成本,且交互容错率较低。本文提出SIAgent,一种创新的"意图到操作"框架,允许用户基于常识和习惯通过自然的眼手运动表达交互意图。我们的系统包含两个核心组件:(1) 将空间交互数据转化为自然语言并推断用户意图的意图识别模块,(2) 生成智能体执行对应任务的基于智能体的执行模块。该方法无需记忆手势,能适应个体运动偏好并具有高容错性。我们在超过60项交互任务中开展了两项用户研究,将本方法与两种"操作到意图"技术进行对比。结果表明:相较于凝视+捏合交互,我们的方法实现了更高的意图识别准确率(97.2% vs 93.1%),同时降低了手臂疲劳度,提升了可用性与用户偏好。另一项研究验证了眼部注视和手部运动通道在意图识别中的功能。本研究通过意图驱动设计为提升VR交互智能提供了重要见解。我们的源代码与大语言模型提示词将在论文发表时公开。