RoboSubtaskNet：面向现实环境人机技能传递的时序子任务分割 (RoboSubtaskNet: Temporal Sub-task Segmentation for Human-to-Robot Skill Transfer in Real-World Environments)

Temporally locating and classifying fine-grained sub-task segments in long, untrimmed videos is crucial to safe human-robot collaboration. Unlike generic activity recognition, collaborative manipulation requires sub-task labels that are directly robot-executable. We present RoboSubtaskNet, a multi-stage human-to-robot sub-task segmentation framework that couples attention-enhanced I3D features (RGB plus optical flow) with a modified MS-TCN employing a Fibonacci dilation schedule to capture better short-horizon transitions such as reach-pick-place. The network is trained with a composite objective comprising cross-entropy and temporal regularizers (truncated MSE and a transition-aware term) to reduce over-segmentation and to encourage valid sub-task progressions. To close the gap between vision benchmarks and control, we introduce RoboSubtask, a dataset of healthcare and industrial demonstrations annotated at the sub-task level and designed for deterministic mapping to manipulator primitives. Empirically, RoboSubtaskNet outperforms MS-TCN and MS-TCN++ on GTEA and our RoboSubtask benchmark (boundary-sensitive and sequence metrics), while remaining competitive on the long-horizon Breakfast benchmark. Specifically, RoboSubtaskNet attains F1 @ 50 = 79.5%, Edit = 88.6%, Acc = 78.9% on GTEA; F1 @ 50 = 30.4%, Edit = 52.0%, Acc = 53.5% on Breakfast; and F1 @ 50 = 94.2%, Edit = 95.6%, Acc = 92.2% on RoboSubtask. We further validate the full perception-to-execution pipeline on a 7-DoF Kinova Gen3 manipulator, achieving reliable end-to-end behavior in physical trials (overall task success approx 91.25%). These results demonstrate a practical path from sub-task level video understanding to deployed robotic manipulation in real-world settings.

翻译：在长时、未修剪视频中时序定位与分类细粒度子任务片段对于安全的人机协作至关重要。与通用活动识别不同，协作式操作需要可直接由机器人执行的子任务标签。本文提出RoboSubtaskNet，一种多阶段人机子任务分割框架，该框架将注意力增强的I3D特征（RGB加光流）与采用斐波那契膨胀调度的改进型MS-TCN相结合，以更好地捕捉短时程转换（如“够取-拾取-放置”）。网络通过包含交叉熵与时序正则化器（截断均方误差及转移感知项）的复合目标函数进行训练，以减少过分割并促进有效的子任务递进。为弥合视觉基准与控制之间的差距，我们引入了RoboSubtask数据集，该数据集包含医疗与工业场景演示，在子任务级别进行标注，并设计用于确定性映射至机械臂操作基元。实验表明，RoboSubtaskNet在GTEA及我们的RoboSubtask基准（边界敏感与序列指标）上优于MS-TCN与MS-TCN++，同时在长时程Breakfast基准上保持竞争力。具体而言，RoboSubtaskNet在GTEA上取得F1@50=79.5%、Edit=88.6%、Acc=78.9%；在Breakfast上取得F1@50=30.4%、Edit=52.0%、Acc=53.5%；在RoboSubtask上取得F1@50=94.2%、Edit=95.6%、Acc=92.2%。我们进一步在7自由度Kinova Gen3机械臂上验证了完整的感知-执行流水线，在物理实验中实现了可靠的端到端行为（总体任务成功率约91.25%）。这些结果展示了从子任务级视频理解到现实场景中机器人操作部署的可行路径。