CONTHER：基于后见经验回放与Transformer的无专家演示类人上下文机器人学习 (CONTHER: Human-Like Contextual Robot Learning via Hindsight Experience Replay and Transformers without Expert Demonstrations)

This paper presents CONTHER, a novel reinforcement learning algorithm designed to efficiently and rapidly train robotic agents for goal-oriented manipulation tasks and obstacle avoidance. The algorithm uses a modified replay buffer inspired by the Hindsight Experience Replay (HER) approach to artificially populate experience with successful trajectories, effectively addressing the problem of sparse reward scenarios and eliminating the need to manually collect expert demonstrations. The developed algorithm proposes a Transformer-based architecture to incorporate the context of previous states, allowing the agent to perform a deeper analysis and make decisions in a manner more akin to human learning. The effectiveness of the built-in replay buffer, which acts as an "internal demonstrator", is twofold: it accelerates learning and allows the algorithm to adapt to different tasks. Empirical data confirm the superiority of the algorithm by an average of 38.46% over other considered methods, and the most successful baseline by 28.21%, showing higher success rates and faster convergence in the point-reaching task. Since the control is performed through the robot's joints, the algorithm facilitates potential adaptation to a real robot system and construction of an obstacle avoidance task. Therefore, the algorithm has also been tested on tasks requiring following a complex dynamic trajectory and obstacle avoidance. The design of the algorithm ensures its applicability to a wide range of goal-oriented tasks, making it an easily integrated solution for real-world robotics applications.

翻译：本文提出CONTHER，一种新颖的强化学习算法，旨在高效快速地训练机器人智能体完成目标导向的操作任务与避障。该算法采用受后见经验回放（HER）启发的改进型回放缓冲区，通过成功轨迹人工填充经验数据，有效解决了稀疏奖励场景问题，并消除了手动收集专家演示的需求。所提出的算法采用基于Transformer的架构来融合历史状态上下文，使智能体能进行更深入的分析并以更接近人类学习的方式做出决策。内置回放缓冲区作为"内部演示器"具有双重功效：既加速学习过程，又使算法能适应不同任务。实验数据证实该算法平均优于其他对比方法38.46%，较最优基线方法提升28.21%，在点位到达任务中表现出更高的成功率和更快的收敛速度。由于控制通过机器人关节执行，该算法便于适配真实机器人系统并构建避障任务。因此，算法还在需要跟踪复杂动态轨迹和避障的任务上进行了验证。算法设计确保其适用于广泛的目标导向任务，使其成为现实世界机器人应用中易于集成的解决方案。