We explore sim-to-real transfer of deep reinforcement learning controllers for a heavy vehicle with active suspensions designed for traversing rough terrain. While related research primarily focuses on lightweight robots with electric motors and fast actuation, this study uses a forestry vehicle with a complex hydraulic driveline and slow actuation. We simulate the vehicle using multibody dynamics and apply system identification to find an appropriate set of simulation parameters. We then train policies in simulation using various techniques to mitigate the sim-to-real gap, including domain randomization, action delays, and a reward penalty to encourage smooth control. In reality, the policies trained with action delays and a penalty for erratic actions perform nearly at the same level as in simulation. In experiments on level ground, the motion trajectories closely overlap when turning to either side, as well as in a route tracking scenario. When faced with a ramp that requires active use of the suspensions, the simulated and real motions are in close alignment. This shows that the actuator model together with system identification yields a sufficiently accurate model of the actuators. We observe that policies trained without the additional action penalty exhibit fast switching or bang-bang control. These present smooth motions and high performance in simulation but transfer poorly to reality. We find that policies make marginal use of the local height map for perception, showing no indications of predictive planning. However, the strong transfer capabilities entail that further development concerning perception and performance can be largely confined to simulation.
翻译:我们探索了面向重型车辆的深度强化学习控制器的仿真到现实迁移,该车辆配备用于穿越崎岖地形的主动悬挂系统。虽然相关研究主要聚焦于采用电机驱动和快速响应的轻量级机器人,但本研究所用的林业车辆具有复杂的液压传动系统和缓慢的驱动特性。我们通过多体动力学进行车辆仿真,并运用系统辨识方法确定合适的仿真参数集。随后在仿真环境中训练策略时,采用了多种技术来缩小仿真与现实差距,包括领域随机化、动作延迟以及通过奖励惩罚机制鼓励平滑控制。实验表明,引入动作延迟与不规则动作惩罚机制训练的策略,其在现实环境中的表现与仿真结果几乎持平。在平坦地面的实验中,车辆左转与右转的运动轨迹高度吻合,路径跟踪场景同样如此。面对需要主动运用悬挂系统的斜坡地形时,仿真与现实的运动状态也呈现高度一致性。这表明执行器模型与系统辨识相结合,能够构建足够精确的执行器模型。我们观察到,未增加额外动作惩罚训练的策略会表现出快速切换或Bang-Bang控制特征。这类策略在仿真中呈现平滑运动与高绩效,但向现实迁移效果不佳。研究发现策略对局部高程地图的感知利用程度有限,未展现出预测性规划特征。然而,其强大的迁移能力意味着有关感知与性能的后续开发工作可主要局限在仿真环境中进行。