Temporally locating and classifying fine-grained sub-task segments in long, untrimmed videos is crucial to safe human-robot collaboration. Unlike generic activity recognition, collaborative manipulation requires sub-task labels that are directly robot-executable. We present RoboSubtaskNet, a multi-stage human-to-robot sub-task segmentation framework that couples attention-enhanced I3D features (RGB plus optical flow) with a modified MS-TCN employing a Fibonacci dilation schedule to capture better short-horizon transitions such as reach-pick-place. The network is trained with a composite objective comprising cross-entropy and temporal regularizers (truncated MSE and a transition-aware term) to reduce over-segmentation and to encourage valid sub-task progressions. To close the gap between vision benchmarks and control, we introduce RoboSubtask, a dataset of healthcare and industrial demonstrations annotated at the sub-task level and designed for deterministic mapping to manipulator primitives. Empirically, RoboSubtaskNet outperforms MS-TCN and MS-TCN++ on GTEA and our RoboSubtask benchmark (boundary-sensitive and sequence metrics), while remaining competitive on the long-horizon Breakfast benchmark. Specifically, RoboSubtaskNet attains F1 @ 50 = 79.5%, Edit = 88.6%, Acc = 78.9% on GTEA; F1 @ 50 = 30.4%, Edit = 52.0%, Acc = 53.5% on Breakfast; and F1 @ 50 = 94.2%, Edit = 95.6%, Acc = 92.2% on RoboSubtask. We further validate the full perception-to-execution pipeline on a 7-DoF Kinova Gen3 manipulator, achieving reliable end-to-end behavior in physical trials (overall task success approx 91.25%). These results demonstrate a practical path from sub-task level video understanding to deployed robotic manipulation in real-world settings.
翻译:在长时未修剪视频中准确定位和分类细粒度子任务片段对于安全的人机协作至关重要。与通用活动识别不同,协作式操作需要可直接由机器人执行的子任务标签。本文提出RoboSubtaskNet——一种多阶段人机子任务分割框架,该框架将注意力增强的I3D特征(RGB加光流)与采用斐波那契膨胀调度的改进型MS-TCN相结合,以更好地捕捉短时域动作转换(如抓取-拾取-放置)。网络采用复合目标函数进行训练,该函数包含交叉熵损失与时序正则化项(截断均方误差及转移感知项),以减少过分割并促进有效的子任务递进。为弥合视觉基准与控制之间的鸿沟,我们构建了RoboSubtask数据集,该数据集包含医疗与工业场景的演示视频,在子任务级别进行标注,并设计为可确定性映射到机械臂操作基元。实验表明,RoboSubtaskNet在GTEA数据集及我们提出的RoboSubtask基准(边界敏感指标与序列指标)上均优于MS-TCN与MS-TCN++,同时在长时域Breakfast基准上保持竞争力。具体而言,RoboSubtaskNet在GTEA上取得F1@50=79.5%、Edit=88.6%、Acc=78.9%;在Breakfast上取得F1@50=30.4%、Edit=52.0%、Acc=53.5%;在RoboSubtask上取得F1@50=94.2%、Edit=95.6%、Acc=92.2%。我们进一步在7自由度Kinova Gen3机械臂上验证了完整的感知-执行流程,在物理实验中实现了可靠的端到端行为(总体任务成功率约91.25%)。这些结果表明了从子任务级视频理解到真实场景中机器人操作部署的可行路径。