Long-range Human-Robot Interaction (HRI) remains underexplored. Within it, Command Source Identification (CSI) - determining who issued a command - is especially challenging due to multi-user and distance-induced sensor ambiguity. We introduce HiSync, an optical-inertial fusion framework that treats hand motion as binding cues by aligning robot-mounted camera optical flow with hand-worn IMU signals. We first elicit a user-defined (N=12) gesture set and collect a multimodal command gesture dataset (N=38) in long-range multi-user HRI scenarios. Next, HiSync extracts frequency-domain hand motion features from both camera and IMU data, and a learned CSINet denoises IMU readings, temporally aligns modalities, and performs distance-aware multi-window fusion to compute cross-modal similarity of subtle, natural gestures, enabling robust CSI. In three-person scenes up to 34m, HiSync achieves 92.32% CSI accuracy, outperforming the prior SOTA by 48.44%. HiSync is also validated on real-robot deployment. By making CSI reliable and natural, HiSync provides a practical primitive and design guidance for public-space HRI. https://github.com/OctopusWen/HiSync
翻译:远程人机交互(HRI)仍是一个尚未充分探索的领域。其中,指令源识别(CSI)——即确定指令由谁发出——由于多用户及距离导致的传感模糊性而尤为困难。我们提出HiSync,一种光惯性融合框架,通过将机器人搭载的摄像头光流与手部佩戴的IMU信号对齐,将手部运动视为绑定线索。首先,我们通过用户定义(N=12)手势集,并在远程多用户HRI场景中收集多模态指令手势数据集(N=38)。接着,HiSync从摄像头和IMU数据中提取频域手部运动特征,所学习的CSINet对IMU读数进行去噪、时间对齐各模态,并通过距离感知的多窗口融合计算细微自然手势的跨模态相似度,从而实现鲁棒的CSI。在长达34米的三人大场景中,HiSync达到92.32%的CSI准确率,较先前SOTA提升48.44%。HiSync也在真实机器人部署中得到验证。通过使CSI可靠且自然,HiSync为公共空间HRI提供了实用的基础模块与设计指导。https://github.com/OctopusWen/HiSync