ffective Human-Robot Interaction (HRI) is crucial for enhancing accessibility and usability in real-world robotics applications. However, existing solutions often rely on gesture- only or language-only commands, making interaction inefficient and ambiguous, particularly for users with physical impairments. In this paper, we introduce FAM-HRI, an efficient multimodal framework for HRI that integrates language and gaze inputs via foundation models. By leveraging lightweight Meta ARIA glasses, our system captures real-time multimodal signals and utilizes large language models (LLMs) to fuse user intention with scene context, enabling intuitive and precise robot manipulation. Our method accurately determines the gaze fixation time interval, reducing noise caused by the gaze dynamic nature. Experimental evaluations demonstrate that FAM-HRI achieves a high success rate in task execution while maintaining a low interaction time, providing a practical solution for individuals with limited physical mobility or motor impairments. To support the community, we have released our system design, algorithms, and solutions at https://github.com/laiyuzhi/FAM-HRI.
翻译:有效的人机交互对于提升现实机器人应用的可及性和可用性至关重要。然而,现有解决方案通常仅依赖手势或语言指令,导致交互效率低下且存在歧义,尤其对于身体障碍用户。本文提出FAM-HRI,一种高效的多模态人机交互框架,通过基础模型整合语言与注视输入。系统利用轻量级Meta ARIA眼镜实时捕获多模态信号,并借助大型语言模型融合用户意图与场景上下文,实现直观精准的机器人操控。我们提出的方法能够精确确定注视停留时间区间,有效降低因注视动态特性引入的噪声。实验评估表明,FAM-HRI在保持低交互时间的同时,实现了高任务执行成功率,为行动受限或运动功能障碍人群提供了实用解决方案。为支持社区研究,我们已在https://github.com/laiyuzhi/FAM-HRI公开了系统设计、算法及解决方案。