Aiming to produce reinforcement learning (RL) policies that are human-interpretable and can generalize better to novel scenarios, Trivedi et al. (2021) present a method (LEAPS) that first learns a program embedding space to continuously parameterize diverse programs from a pre-generated program dataset, and then searches for a task-solving program in the learned program embedding space when given a task. Despite the encouraging results, the program policies that LEAPS can produce are limited by the distribution of the program dataset. Furthermore, during searching, LEAPS evaluates each candidate program solely based on its return, failing to precisely reward correct parts of programs and penalize incorrect parts. To address these issues, we propose to learn a meta-policy that composes a series of programs sampled from the learned program embedding space. By learning to compose programs, our proposed hierarchical programmatic reinforcement learning (HPRL) framework can produce program policies that describe out-of-distributionally complex behaviors and directly assign credits to programs that induce desired behaviors. The experimental results in the Karel domain show that our proposed framework outperforms baselines. The ablation studies confirm the limitations of LEAPS and justify our design choices.
翻译:为生成人类可解释且能更好泛化至新场景的强化学习策略,Trivedi等人(2021)提出了一种方法(LEAPS),该方法首先学习一个程序嵌入空间,以连续参数化预生成程序数据集中的多样程序,随后在给定任务时于所学程序嵌入空间中搜索求解程序。尽管取得了令人鼓舞的结果,但LEAPS所能生成的程序策略受限于程序数据集的分布。此外,在搜索过程中,LEAPS仅依据候选程序的累积回报进行评估,未能精确奖励程序的正确部分并惩罚错误部分。为解决这些问题,我们提出学习一个元策略,用于组合从所学程序嵌入空间中采样的一系列程序。通过学习程序组合,我们所提出的分层可编程强化学习(HPRL)框架能够生成描述分布外复杂行为的程序策略,并直接将信用分配给能诱发期望行为的程序。在Karel领域的实验结果表明,我们提出的框架优于基线方法。消融研究确认了LEAPS的局限性,并验证了我们的设计选择。