Reward shaping has been applied widely to accelerate Reinforcement Learning (RL) agents' training. However, a principled way of designing effective reward shaping functions, especially for complex continuous control problems, remains largely under-explained. In this work, we propose to automatically learn a reward shaping function for continuous control problems from offline datasets, potentially contaminated by unobserved confounding variables. Specifically, our method builds upon the recently proposed causal Bellman equation to learn a tight upper bound on the optimal state values, which is then used as the potentials in the Potential-Based Reward Shaping (PBRS) framework. Our proposed reward shaping algorithm is tested with Soft-Actor-Critic (SAC) on multiple commonly used continuous control benchmarks and exhibits strong performance guarantees under unobserved confounders. More broadly, our work marks a solid first step towards confounding robust continuous control from a causal perspective. Code for training our reward shaping functions can be found at https://github.com/mateojuliani/confounding_robust_cont_control.
翻译:奖励塑形已被广泛应用于加速强化学习(RL)智能体的训练。然而,设计有效的奖励塑形函数(尤其是针对复杂连续控制问题)的原理性方法在很大程度上仍未得到充分阐释。在本工作中,我们提出从可能被未观测混淆变量污染的离线数据集中,为连续控制问题自动学习一种奖励塑形函数。具体而言,我们的方法基于最近提出的因果贝尔曼方程,学习最优状态值的一个紧致上界,随后将其用作基于势的奖励塑形(PBRS)框架中的势函数。我们提出的奖励塑形算法在多个常用连续控制基准测试中与Soft-Actor-Critic(SAC)结合进行了验证,并在未观测混淆变量下展现出强大的性能保证。更广泛而言,我们的工作标志着从因果视角实现混淆鲁棒连续控制迈出了坚实的第一步。训练我们奖励塑形函数的代码可在 https://github.com/mateojuliani/confounding_robust_cont_control 找到。