This work presents a next-generation human-robot interface that can infer and realize the user's manipulation intention via sight only. Specifically, we develop a system that integrates near-eye-tracking and robotic manipulation to enable user-specified actions (e.g., grasp, pick-and-place, etc), where visual information is merged with human attention to create a mapping for desired robot actions. To enable sight guided manipulation, a head-mounted near-eye-tracking device is developed to track the eyeball movements in real-time, so that the user's visual attention can be identified. To improve the grasping performance, a transformer based grasp model is then developed. Stacked transformer blocks are used to extract hierarchical features where the volumes of channels are expanded at each stage while squeezing the resolution of feature maps. Experimental validation demonstrates that the eye-tracking system yields low gaze estimation error and the grasping system yields promising results on multiple grasping datasets. This work is a proof of concept for gaze interaction-based assistive robot, which holds great promise to help the elder or upper limb disabilities in their daily lives. A demo video is available at https://www.youtube.com/watch?v=yuZ1hukYUrM
翻译:本文提出了一种新一代人机交互界面,该界面仅通过视觉即可推断并实现用户的操作意图。具体而言,我们开发了一个集成近眼追踪与机器人操作的系统,能够执行用户指定的动作(如抓取、拾取-放置等),其中视觉信息与人类注意力相结合,以建立所需机器人动作的映射。为实现视觉引导操作,我们开发了一种头戴式近眼追踪设备,可实时追踪眼球运动,从而识别用户的视觉注意力。为提升抓取性能,进一步开发了基于Transformer的抓取模型。该模型采用堆叠的Transformer模块提取层次化特征,在每一阶段扩展通道体积的同时压缩特征图分辨率。实验验证表明,该眼动追踪系统具有较低的眼动估计误差,且抓取系统在多个抓取数据集上取得了优异结果。本研究为基于注视交互的辅助机器人提供了概念验证,有望在老年人及上肢残疾人士的日常生活中发挥重要辅助作用。演示视频见https://www.youtube.com/watch?v=yuZ1hukYUrM