Long-range Human-Robot Interaction (HRI) remains underexplored. Within it, Command Source Identification (CSI) - determining who issued a command - is especially challenging due to multi-user and distance-induced sensor ambiguity. We introduce HiSync, an optical-inertial fusion framework that treats hand motion as binding cues by aligning robot-mounted camera optical flow with hand-worn IMU signals. We first elicit a user-defined (N=12) gesture set and collect a multimodal command gesture dataset (N=38) in long-range multi-user HRI scenarios. Next, HiSync extracts frequency-domain hand motion features from both camera and IMU data, and a learned CSINet denoises IMU readings, temporally aligns modalities, and performs distance-aware multi-window fusion to compute cross-modal similarity of subtle, natural gestures, enabling robust CSI. In three-person scenes up to 34m, HiSync achieves 92.32% CSI accuracy, outperforming the prior SOTA by 48.44%. HiSync is also validated on real-robot deployment. By making CSI reliable and natural, HiSync provides a practical primitive and design guidance for public-space HRI.
翻译:远距离人机交互(HRI)领域的研究仍不充分。其中,指令源识别(CSI)——即确定指令发出者——由于多用户参与及距离导致的传感器模糊性而尤为困难。本文提出HiSync,一种光惯融合框架,通过将机器人搭载摄像头的光流与手部佩戴的惯性测量单元(IMU)信号进行对齐,将手部运动作为绑定线索。我们首先征集了一套用户自定义(N=12)手势集,并在远距离多用户HRI场景中收集了一个多模态指令手势数据集(N=38)。随后,HiSync从摄像头和IMU数据中提取频域手部运动特征,并通过学习的CSINet对IMU读数进行去噪、对多模态数据进行时间对齐,并执行距离感知的多窗口融合,以计算细微自然手势的跨模态相似度,从而实现鲁棒的CSI。在距离达34米的三用户场景中,HiSync实现了92.32%的CSI准确率,较先前最优方法提升48.44%。HiSync在真实机器人部署中也得到了验证。通过使CSI可靠且自然,HiSync为公共空间HRI提供了实用的基础组件与设计指导。