This work presents a hierarchical framework for bipedal locomotion that combines a Reinforcement Learning (RL)-based high-level (HL) planner policy for the online generation of task space commands with a model-based low-level (LL) controller to track the desired task space trajectories. Different from traditional end-to-end learning approaches, our HL policy takes insights from the angular momentum-based linear inverted pendulum (ALIP) to carefully design the observation and action spaces of the Markov Decision Process (MDP). This simple yet effective design creates an insightful mapping between a low-dimensional state that effectively captures the complex dynamics of bipedal locomotion and a set of task space outputs that shape the walking gait of the robot. The HL policy is agnostic to the task space LL controller, which increases the flexibility of the design and generalization of the framework to other bipedal robots. This hierarchical design results in a learning-based framework with improved performance, data efficiency, and robustness compared with the ALIP model-based approach and state-of-the-art learning-based frameworks for bipedal locomotion. The proposed hierarchical controller is tested in three different robots, Rabbit, a five-link underactuated planar biped; Walker2D, a seven-link fully-actuated planar biped; and Digit, a 3D humanoid robot with 20 actuated joints. The trained policy naturally learns human-like locomotion behaviors and is able to effectively track a wide range of walking speeds while preserving the robustness and stability of the walking gait even under adversarial conditions.
翻译:本文提出了一种双足运动的分层框架,该框架结合了基于强化学习(RL)的高层(HL)规划策略——用于在线生成任务空间指令——以及基于模型的低层(LL)控制器,以跟踪期望的任务空间轨迹。与传统的端到端学习方法不同,我们的高层策略从基于角动量线性倒立摆(ALIP)中汲取见解,精心设计了马尔可夫决策过程(MDP)的观测空间和动作空间。这种简单而有效的设计,在能够有效捕捉双足运动复杂动力学的低维状态与塑造机器人行走步态的任务空间输出之间建立了富有洞察力的映射关系。高层策略对低层任务空间控制器具有无关性,这增强了设计的灵活性以及框架向其他双足机器人的泛化能力。与基于ALIP模型的方法以及当前最先进的基于学习的双足运动框架相比,这种分层设计带来的学习框架在性能、数据效率和鲁棒性方面均有提升。所提出的分层控制器在三种不同的机器人上进行了测试:Rabbit(一个五连杆欠驱动平面双足机器人)、Walker2D(一个七连杆全驱动平面双足机器人)以及Digit(一个具有20个驱动关节的三维人形机器人)。训练后的策略自然地习得了类似人类的行走行为,并能够在恶劣条件下有效跟踪广泛的步行速度,同时保持行走步态的鲁棒性和稳定性。