We model a Markov decision process, parametrized by an unknown parameter, and study the asymptotic behavior of a sampling-based algorithm, called Thompson sampling. The standard definition of regret is not always suitable to evaluate a policy, especially when the underlying chain structure is general. We show that the standard (expected) regret can grow (super-)linearly and fails to capture the notion of learning in realistic settings with non-trivial state evolution. By decomposing the standard (expected) regret, we develop a new metric, called the expected residual regret, which forgets the immutable consequences of past actions. Instead, it measures regret against the optimal reward moving forward from the current period. We show that the expected residual regret of the Thompson sampling algorithm is upper bounded by a term which converges exponentially fast to 0. We present conditions under which the posterior sampling error of Thompson sampling converges to 0 almost surely. We then introduce the probabilistic version of the expected residual regret and present conditions under which it converges to 0 almost surely. Thus, we provide a viable concept of learning for sampling algorithms which will serve useful in broader settings than had been considered previously.
翻译:我们建模了一个由未知参数参数化的马尔可夫决策过程,并研究了基于采样的算法——汤普森采样的渐近行为。标准遗憾定义在评估策略时并不总是适用,尤其是当底层链结构具有一般性时。我们证明,在具有非平凡状态演化的现实设定中,标准(期望)遗憾可能(超)线性增长,且无法捕捉学习的概念。通过分解标准(期望)遗憾,我们提出了一种新度量——期望剩余遗憾,该度量淡化了过去行为不可改变的结果,转而测量从当前时期起向前推进时相对于最优回报的遗憾。我们证明汤普森采样算法的期望剩余遗憾上界以一个指数收敛于0的项为界。我们给出了汤普森采样的后验采样误差几乎必然收敛于0的条件。随后,我们引入了期望剩余遗憾的概率版本,并给出了其几乎必然收敛于0的条件。因此,我们为采样算法提供了一个可行的学习概念,这将在比以往考虑的更广泛的设定中发挥作用。