Deep reinforcement learning (deep RL) excels in various domains but lacks generalizability and interpretability. On the other hand, programmatic RL methods (Trivedi et al., 2021; Liu et al., 2023) reformulate RL tasks as synthesizing interpretable programs that can be executed in the environments. Despite encouraging results, these methods are limited to short-horizon tasks. On the other hand, representing RL policies using state machines (Inala et al., 2020) can inductively generalize to long-horizon tasks; however, it struggles to scale up to acquire diverse and complex behaviors. This work proposes the Program Machine Policy (POMP), which bridges the advantages of programmatic RL and state machine policies, allowing for the representation of complex behaviors and the address of long-term tasks. Specifically, we introduce a method that can retrieve a set of effective, diverse, and compatible programs. Then, we use these programs as modes of a state machine and learn a transition function to transition among mode programs, allowing for capturing repetitive behaviors. Our proposed framework outperforms programmatic RL and deep RL baselines on various tasks and demonstrates the ability to inductively generalize to even longer horizons without any fine-tuning. Ablation studies justify the effectiveness of our proposed search algorithm for retrieving a set of programs as modes.
翻译:深度强化学习(deep RL)在多个领域表现卓越,但缺乏通用性与可解释性。另一方面,程序化强化学习方法(Trivedi et al., 2021; Liu et al., 2023)将强化学习任务重构为合成可在环境中执行的可解释程序。尽管取得了令人鼓舞的结果,但这些方法仅适用于短时域任务。相反,使用状态机表示强化学习策略(Inala et al., 2020)能够归纳泛化至长时域任务,但在获取多样且复杂的行为方面面临扩展性挑战。本文提出程序机器策略(POMP),该框架融合了程序化强化学习与状态机策略的优势,既能表征复杂行为,又能处理长期任务。具体而言,我们引入一种方法以检索一组有效、多样且兼容的程序。随后将这些程序作为状态机的模态,并学习一个转移函数来实现模态程序间的切换,从而捕捉重复性行为。所提出的框架在多项任务中优于程序化强化学习与深度强化学习基线,并展现出无需任何微调即可归纳泛化至更长时间域的能力。消融实验验证了我们提出的用于检索程序集合作为模态的搜索算法的有效性。