Imitation learning enables robots to acquire complex manipulation skills from human demonstrations, but current methods rely solely on low-level sensorimotor data while ignoring the rich semantic knowledge humans naturally possess about tasks. We present ConceptACT, an extension of Action Chunking with Transformers that leverages episode-level semantic concept annotations during training to improve learning efficiency. Unlike language-conditioned approaches that require semantic input at deployment, ConceptACT uses human-provided concepts (object properties, spatial relationships, task constraints) exclusively during demonstration collection, adding minimal annotation burden. We integrate concepts using a modified transformer architecture in which the final encoder layer implements concept-aware cross-attention, supervised to align with human annotations. Through experiments on two robotic manipulation tasks with logical constraints, we demonstrate that ConceptACT converges faster and achieves superior sample efficiency compared to standard ACT. Crucially, we show that architectural integration through attention mechanisms significantly outperforms naive auxiliary prediction losses or language-conditioned models. These results demonstrate that properly integrated semantic supervision provides powerful inductive biases for more efficient robot learning.
翻译:模仿学习使机器人能够从人类演示中习得复杂的操作技能,但现有方法仅依赖低级的感知运动数据,而忽略了人类对任务自然拥有的丰富语义知识。我们提出了ConceptACT,这是Action Chunking with Transformers(ACT)的扩展,它在训练过程中利用片段级语义概念标注来提升学习效率。与在部署时需要语义输入的语言条件化方法不同,ConceptACT仅在演示收集阶段使用人类提供的概念(物体属性、空间关系、任务约束),仅增加了极小的标注负担。我们通过改进的Transformer架构集成概念,其中最终的编码器层实现了概念感知的交叉注意力机制,并通过监督学习使其与人类标注对齐。通过在两个具有逻辑约束的机器人操作任务上进行实验,我们证明ConceptACT相比标准ACT收敛更快,并实现了更优的样本效率。关键的是,我们表明通过注意力机制进行的架构集成显著优于简单的辅助预测损失或语言条件化模型。这些结果表明,经过恰当集成的语义监督能为更高效的机器人学习提供强大的归纳偏置。