Deep reinforcement learning excels in various domains but lacks generalizability and interoperability. Programmatic RL methods (Trivedi et al., 2021; Liu et al., 2023) reformulate solving RL tasks as synthesizing interpretable programs that can be executed in the environments. Despite encouraging results, these methods are limited to short-horizon tasks. On the other hand, representing RL policies using state machines (Inala et al., 2020) can inductively generalize to long-horizon tasks; however, it struggles to scale up to acquire diverse and complex behaviors. This work proposes Program Machine Policies (POMPs), which bridge the advantages of programmatic RL and state machine policies, allowing for the representation of complex behaviors and the address of long-term tasks. Specifically, we introduce a method that can retrieve a set of effective, diverse, compatible programs. Then, we use these programs as modes of a state machine and learn a transition function to transition among mode programs, allowing for capturing long-horizon repetitive behaviors. Our proposed framework outperforms programmatic RL and deep RL baselines on various tasks and demonstrates the ability to generalize to even longer horizons without any fine-tuning inductively. Ablation studies justify the effectiveness of our proposed search algorithm for retrieving a set of programs as modes.
翻译:深度强化学习在多个领域表现出色,但缺乏泛化性和可解释性。程序化强化学习方法(Trivedi等人,2021;Liu等人,2023)将求解强化学习任务重新定义为合成可在环境中执行的可解释程序。尽管取得了令人鼓舞的成果,但这些方法仅限于短时域任务。另一方面,使用状态机(Inala等人,2020)表示强化学习策略能归纳泛化至长时域任务,但在获取多样化和复杂行为方面难以扩展。本研究提出程序机器策略(POMPs),融合程序化强化学习与状态机策略的优势,既能表征复杂行为又能处理长时域任务。具体而言,我们引入了一种方法,可检索出有效、多样且兼容的程序集合,并将其作为状态机的模式,通过学习转移函数实现模式间转换,从而捕获长时域重复行为。本文提出的框架在多项任务中优于程序化强化学习及深度强化学习基线方法,且无需微调即可归纳泛化至更长时域任务。消融实验验证了所提出搜索算法在检索程序模式集合方面的有效性。