We investigate an infinite-horizon average reward Markov Decision Process (MDP) with delayed, composite, and partially anonymous reward feedback. The delay and compositeness of rewards mean that rewards generated as a result of taking an action at a given state are fragmented into different components, and they are sequentially realized at delayed time instances. The partial anonymity attribute implies that a learner, for each state, only observes the aggregate of past reward components generated as a result of different actions taken at that state, but realized at the observation instance. We propose an algorithm named $\mathrm{DUCRL2}$ to obtain a near-optimal policy for this setting and show that it achieves a regret bound of $\tilde{\mathcal{O}}\left(DS\sqrt{AT} + d (SA)^3\right)$ where $S$ and $A$ are the sizes of the state and action spaces, respectively, $D$ is the diameter of the MDP, $d$ is a parameter upper bounded by the maximum reward delay, and $T$ denotes the time horizon. This demonstrates the optimality of the bound in the order of $T$, and an additive impact of the delay.
翻译:我们研究了具有延迟、复合及部分匿名奖励反馈的无限时域平均奖励马尔可夫决策过程(MDP)。奖励的延迟性与复合性意味着在特定状态执行动作所产生的奖励被拆分为不同组成部分,并依序在延迟的时间节点实现。部分匿名属性表明,对于每个状态,学习器仅能观测到该状态下不同动作执行所产生的、但于观测时刻实现的过往奖励成分之和。我们提出名为$\mathrm{DUCRL2}$的算法以获取该场景下的近最优策略,并证明其实现了$\tilde{\mathcal{O}}\left(DS\sqrt{AT} + d (SA)^3\right)$的遗憾界,其中$S$与$A$分别为状态与动作空间的规模,$D$为MDP的直径,$d$为以最大奖励延迟为上界的参数,$T$表示时间范围。这证明了该界在$T$阶上的最优性,以及延迟的加性影响。