The agent learns to organize decision behavior to achieve a behavioral goal, such as reward maximization, and reinforcement learning is often used for this optimization. Learning an optimal behavioral strategy is difficult under the uncertainty that events necessary for learning are only partially observable, called as Partially Observable Markov Decision Process (POMDP). However, the real-world environment also gives many events irrelevant to reward delivery and an optimal behavioral strategy. The conventional methods in POMDP, which attempt to infer transition rules among the entire observations, including irrelevant states, are ineffective in such an environment. Supposing Redundantly Observable Markov Decision Process (ROMDP), here we propose a method for goal-oriented reinforcement learning to efficiently learn state transition rules among reward-related "core states'' from redundant observations. Starting with a small number of initial core states, our model gradually adds new core states to the transition diagram until it achieves an optimal behavioral strategy consistent with the Bellman equation. We demonstrate that the resultant inference model outperforms the conventional method for POMDP. We emphasize that our model only containing the core states has high explainability. Furthermore, the proposed method suits online learning as it suppresses memory consumption and improves learning speed.
翻译:智能体通过组织决策行为来实现行为目标(如奖励最大化),强化学习常被用于此类优化。在部分可观测马尔可夫决策过程(POMDP)这一学习所需事件仅部分可观测的不确定环境下,学习最优行为策略尤为困难。然而,真实环境往往同时提供与奖励递送及最优行为策略无关的大量事件。传统POMDP方法试图推断包含无关状态在内的全部观测间的转移规则,在此类环境中效率低下。本研究假设冗余可观测马尔可夫决策过程(ROMDP),提出一种目标导向强化学习方法,能够从冗余观测中高效学习与奖励相关的"核心状态"间转移规则。模型从少量初始核心状态出发,逐步向转移图中添加新核心状态,直至获得满足贝尔曼方程的最优行为策略。实验表明,该推断模型性能优于传统POMDP方法。值得强调的是,仅包含核心状态的模型具有高度可解释性。此外,本方法通过抑制内存消耗并提升学习速度,适用于在线学习场景。