Distilling Reinforcement Learning Policies for Interpretable Robot Locomotion: Gradient Boosting Machines and Symbolic Regression

Recent advancements in reinforcement learning (RL) have led to remarkable achievements in robot locomotion capabilities. However, the complexity and ``black-box'' nature of neural network-based RL policies hinder their interpretability and broader acceptance, particularly in applications demanding high levels of safety and reliability. This paper introduces a novel approach to distill neural RL policies into more interpretable forms using Gradient Boosting Machines (GBMs), Explainable Boosting Machines (EBMs) and Symbolic Regression. By leveraging the inherent interpretability of generalized additive models, decision trees, and analytical expressions, we transform opaque neural network policies into more transparent ``glass-box'' models. We train expert neural network policies using RL and subsequently distill them into (i) GBMs, (ii) EBMs, and (iii) symbolic policies. To address the inherent distribution shift challenge of behavioral cloning, we propose to use the Dataset Aggregation (DAgger) algorithm with a curriculum of episode-dependent alternation of actions between expert and distilled policies, to enable efficient distillation of feedback control policies. We evaluate our approach on various robot locomotion gaits -- walking, trotting, bounding, and pacing -- and study the importance of different observations in joint actions for distilled policies using various methods. We train neural expert policies for 205 hours of simulated experience and distill interpretable policies with only 10 minutes of simulated interaction for each gait using the proposed method.

翻译：近期强化学习（RL）的进展极大地提升了机器人运动能力。然而，基于神经网络的RL策略的复杂性与"黑箱"特性阻碍了其可解释性和更广泛的接受度，尤其是在要求高安全性与高可靠性的应用中。本文提出了一种新方法，利用梯度提升机（GBMs）、可解释提升机（EBMs）和符号回归将神经RL策略蒸馏为更具可解释性的形式。通过利用广义可加模型、决策树及解析表达式的固有可能解特性，我们将不透明的神经网络策略转化为更透明的"玻璃箱"模型。我们使用RL训练专家神经网络策略，随后将其蒸馏为（i）GBMs、（ii）EBMs和（iii）符号策略。为应对行为克隆中固有的分布偏移挑战，我们提出采用数据集聚合（DAgger）算法，配合基于情节的专家与蒸馏策略交替动作的课程机制，从而高效蒸馏反馈控制策略。我们在多种机器人运动步态——行走、小跑、跳跃和踱步——上评估了该方法，并利用多种方法研究了不同观测值对蒸馏策略联合动作的重要性。我们为每个步态训练了涵盖205小时模拟经验的神经专家策略，并仅通过10分钟模拟交互便使用所提方法蒸馏出可解释策略。