This paper presents an innovative application of Transformer-XL for long sequence tasks in robotic learning from demonstrations (LfD). The proposed framework effectively integrates multi-modal sensor inputs, including RGB-D images, LiDAR, and tactile sensors, to construct a comprehensive feature vector. By leveraging the advanced capabilities of Transformer-XL, particularly its attention mechanism and position encoding, our approach can handle the inherent complexities and long-term dependencies of multi-modal sensory data. The results of an extensive empirical evaluation demonstrate significant improvements in task success rates, accuracy, and computational efficiency compared to conventional methods such as Long Short-Term Memory (LSTM) networks and Convolutional Neural Networks (CNNs). The findings indicate that the Transformer-XL-based framework not only enhances the robot's perception and decision-making abilities but also provides a robust foundation for future advancements in robotic learning from demonstrations.
翻译:本文提出了一种创新的Transformer-XL在机器人示教学习长序列任务中的应用方法。所提出的框架有效整合了包括RGB-D图像、LiDAR和触觉传感器在内的多模态传感器输入,构建了综合的特征向量。通过利用Transformer-XL的先进能力,特别是其注意力机制和位置编码,我们的方法能够处理多模态传感数据固有的复杂性和长期依赖性。广泛的实证评估结果表明,与长短期记忆网络和卷积神经网络等传统方法相比,该方法在任务成功率、准确性和计算效率方面均有显著提升。研究结果表明,基于Transformer-XL的框架不仅增强了机器人的感知与决策能力,还为未来机器人示教学习的进一步发展提供了坚实基础。