Thompson sampling (TS) is widely used in sequential decision making due to its ease of use and appealing empirical performance. However, many existing analytical and empirical results for TS rely on restrictive assumptions on reward distributions, such as belonging to conjugate families, which limits their applicability in realistic scenarios. Moreover, sequential decision making problems are often carried out in a batched manner, either due to the inherent nature of the problem or to serve the purpose of reducing communication and computation costs. In this work, we jointly study these problems in two popular settings, namely, stochastic multi-armed bandits (MABs) and infinite-horizon reinforcement learning (RL), where TS is used to learn the unknown reward distributions and transition dynamics, respectively. We propose batched $\textit{Langevin Thompson Sampling}$ algorithms that leverage MCMC methods to sample from approximate posteriors with only logarithmic communication costs in terms of batches. Our algorithms are computationally efficient and maintain the same order-optimal regret guarantees of $\mathcal{O}(\log T)$ for stochastic MABs, and $\mathcal{O}(\sqrt{T})$ for RL. We complement our theoretical findings with experimental results.
翻译:摘要:汤普森采样(TS)因其易用性和良好的实证表现而广泛应用于序列决策问题。然而,现有针对TS的分析和实证结果大多依赖于对奖励分布的严格假设(例如要求属于共轭分布族),这限制了其在现实场景中的适用性。此外,序列决策问题往往以批次形式进行,既可能是问题的固有特性,也可能是为了降低通信与计算成本。本文联合研究了这两个问题在两类经典场景中的表现——随机多臂赌博机(MABs)与无限时域强化学习(RL),其中TS分别用于学习未知奖励分布与转移动力学。我们提出了批次化朗之万汤普森采样算法,该算法利用MCMC方法从近似后验中采样,仅需对数级别的批次通信量。我们的算法计算高效,且在随机MAB问题上保持与最优解同阶的遗憾界$\mathcal{O}(\log T)$,在RL问题上保持$\mathcal{O}(\sqrt{T})$的遗憾界。我们通过实验验证了理论结果。