Reinforcement learning (RL) has garnered significant attention for developing decision-making agents that aim to maximize rewards, specified by an external supervisor, within fully observable environments. However, many real-world problems involve partial observations, formulated as partially observable Markov decision processes (POMDPs). Previous studies have tackled RL in POMDPs by either incorporating the memory of past actions and observations or by inferring the true state of the environment from observed data. However, aggregating observed data over time becomes impractical in continuous spaces. Moreover, inference-based RL approaches often require many samples to perform well, as they focus solely on reward maximization and neglect uncertainty in the inferred state. Active inference (AIF) is a framework formulated in POMDPs and directs agents to select actions by minimizing a function called expected free energy (EFE). This supplies reward-maximizing (exploitative) behaviour, as in RL, with information-seeking (exploratory) behaviour. Despite this exploratory behaviour of AIF, its usage is limited to discrete spaces due to the computational challenges associated with EFE. In this paper, we propose a unified principle that establishes a theoretical connection between AIF and RL, enabling seamless integration of these two approaches and overcoming their aforementioned limitations in continuous space POMDP settings. We substantiate our findings with theoretical analysis, providing novel perspectives for utilizing AIF in the design of artificial agents. Experimental results demonstrate the superior learning capabilities of our method in solving continuous space partially observable tasks. Notably, our approach harnesses information-seeking exploration, enabling it to effectively solve reward-free problems and rendering explicit task reward design by an external supervisor optional.
翻译:强化学习(RL)在完全可观测环境中通过最大化外部监督者指定的奖励来训练决策智能体,并因此受到广泛关注。然而,许多现实问题涉及部分可观测性,可建模为部分可观测马尔可夫决策过程(POMDP)。以往研究通过两种方式处理POMDP中的RL问题:整合历史动作与观测的记忆,或从观测数据推断环境真实状态。但在连续空间中,随时间累积观测数据变得不切实际。此外,基于推断的RL方法因仅关注奖励最大化而忽略推断状态的不确定性,通常需要大量样本才能达到良好性能。主动推理(AIF)是建立在POMDP框架下的方法论,通过最小化称为期望自由能(EFE)的函数来指导智能体选择动作。这种机制在RL的奖励最大化(利用)行为基础上,补充了信息寻求(探索)行为。尽管AIF具有探索特性,但由于EFE的计算挑战,其应用仅限于离散空间。本文提出统一原理,建立了AIF与RL之间的理论联系,使两种方法能够无缝融合,从而克服它们在连续空间POMDP环境中的上述局限性。我们通过理论分析验证研究结论,为利用AIF设计人工智能体提供了全新视角。实验结果表明,本方法在解决连续空间部分可观测任务中展现出卓越的学习能力。值得注意的是,本方法通过信息寻求探索机制,能够有效解决无奖励问题,使外部监督者设计的显式任务奖励成为可选方案。