We investigate an infinite-horizon average reward Markov Decision Process (MDP) with delayed, composite, and partially anonymous reward feedback. The delay and compositeness of rewards mean that rewards generated as a result of taking an action at a given state are fragmented into different components, and they are sequentially realized at delayed time instances. The partial anonymity attribute implies that a learner, for each state, only observes the aggregate of past reward components generated as a result of different actions taken at that state, but realized at the observation instance. We propose an algorithm named $\mathrm{DUCRL2}$ to obtain a near-optimal policy for this setting and show that it achieves a regret bound of $\tilde{\mathcal{O}}\left(DS\sqrt{AT} + d (SA)^3\right)$ where $S$ and $A$ are the sizes of the state and action spaces, respectively, $D$ is the diameter of the MDP, $d$ is a parameter upper bounded by the maximum reward delay, and $T$ denotes the time horizon. This demonstrates the optimality of the bound in the order of $T$, and an additive impact of the delay.
翻译:我们研究了一种具有延迟、复合和部分匿名奖励反馈的无限时域平均奖励马尔可夫决策过程(MDP)。奖励的延迟性和复合性意味着在指定状态下采取行动所产生的奖励被分解为不同组成部分,并在延迟的时间点上依次实现。部分匿名属性意味着学习者在每个状态中仅能观察到因在该状态下采取不同行动而产生的、但在观测时刻才实现的过去奖励成分的聚合值。我们提出了一种名为 $\mathrm{DUCRL2}$ 的算法,针对该设定获取近最优策略,并证明其达到了 $\tilde{\mathcal{O}}\left(DS\sqrt{AT} + d (SA)^3\right)$ 的遗憾界,其中 $S$ 和 $A$ 分别为状态空间和行动空间的大小,$D$ 是 MDP 的直径,$d$ 是一个受最大奖励延迟上界约束的参数,$T$ 表示时间跨度。这表明该界在 $T$ 的阶数上具有最优性,且延迟会产生附加影响。