Recent advancements in imitation learning, particularly with the integration of LLM techniques, are set to significantly improve robots' dexterity and adaptability. This paper proposes using Mamba, a state-of-the-art architecture with potential applications in LLMs, for robotic imitation learning, highlighting its ability to function as an encoder that effectively captures contextual information. By reducing the dimensionality of the state space, Mamba operates similarly to an autoencoder. It effectively compresses the sequential information into state variables while preserving the essential temporal dynamics necessary for accurate motion prediction. Experimental results in tasks such as cup placing and case loading demonstrate that despite exhibiting higher estimation errors, Mamba achieves superior success rates compared to Transformers in practical task execution. This performance is attributed to Mamba's structure, which encompasses the state space model. Additionally, the study investigates Mamba's capacity to serve as a real-time motion generator with a limited amount of training data.
翻译:近年来模仿学习的进展,特别是与大型语言模型技术的融合,有望显著提升机器人的灵巧性与适应性。本文提出将Mamba——一种在大型语言模型中具有应用潜力的先进架构——用于机器人模仿学习,重点阐述其作为编码器有效捕捉上下文信息的能力。通过降低状态空间的维度,Mamba的运行机制类似于自编码器。它能将序列信息高效压缩为状态变量,同时保持精确运动预测所需的关键时序动态特性。在放置杯具和装箱等任务中的实验结果表明,尽管存在较高的估计误差,但在实际任务执行中,Mamba相比Transformer取得了更优的成功率。这一性能归因于Mamba包含状态空间模型的结构特性。此外,本研究还探讨了Mamba在有限训练数据条件下作为实时运动生成器的能力。