Uncertainty-Aware Intention Prediction for Human-to-Robot Assembly Teleoperation

In assisted teleoperation for human-robot collaboration, accurate intention prediction is critical for enabling timely and reliable robotic assistance during long-horizon manipulation and assembly tasks. These systems require continuous understanding of user behavior to recognize actions, anticipate intentions, and detect mistakes in real time. However, robot teleoperation demonstrations are costly and hardware-limited, whereas human demonstrations are easier to collect and provide rich temporal structure. To address this challenge, we propose an uncertainty-aware human-to-robot intention prediction framework that combines: (1) hierarchical transfer learning, where MS-TCN++ is pretrained on human hand demonstrations and fine-tuned on limited robot teleoperation data to capture low-level actions and high-level task intentions; (2) a conformal prediction module that provides frame-level prediction sets with statistical coverage guarantees for reliable uncertainty quantification and early intention estimation; and (3) VLM-guided segment correction, which selectively reviews low-confidence or temporally uncertain segments using visual and temporal context. The framework supports action recognition, temporal segmentation, intention anticipation, and mistake detection for assisted teleoperation. Experiments on robot assembly demonstrations with 22 action classes show that human-to-robot fine-tuning improves the robot test-set Edit score from 70.50 to 80.70 using only 16 robot demonstrations. Edit-safe VLM correction further improves frame accuracy from 45.21% to 46.42% and increases F1@25 and F1@50 while preserving the Edit score. These results show that human demonstrations provide scalable pretraining data for robust, uncertainty-aware robot action segmentation. Code and data: project website.

翻译：在人机协作的辅助遥操作中，精确的意图预测对于在长周期操作和组装任务中实现及时可靠的机器人辅助至关重要。此类系统需要持续理解用户行为，以实时识别动作、预测意图并检测错误。然而，机器人遥操作演示成本高昂且受硬件限制，而人类演示更易收集且具有丰富的时域结构。为解决该挑战，我们提出了一种不确定性感知的人-机器人意图预测框架，其结合：（1）层次迁移学习，即MS-TCN++在人类手部演示数据上预训练，并在有限机器人遥操作数据上微调，从而捕获底层动作与高层任务意图；（2）一致性预测模块，提供具有统计覆盖保证的帧级预测集，用于可靠的量化不确定性与早期意图估计；（3）视觉语言模型引导的片段校正，利用视觉与时域上下文选择性审查低置信度或时域不确定的片段。该框架支持动作识别、时域分割、意图预测与错误检测，以用于辅助遥操作。在包含22个动作类别的机器人组装演示实验中表明，人-机器人微调仅使用16个机器人演示即可将机器人测试集的Edit分数从70.50提升至80.70。基于Edit安全的视觉语言模型校正进一步将帧准确率从45.21%提升至46.42%，并在保持Edit分数不变的同时提高了F1@25和F1@50。这些结果表明人类演示为鲁棒、不确定性感知的机器人动作分割提供了可扩展的预训练数据。代码与数据：项目网站。