We develop an extension of posterior sampling for reinforcement learning (PSRL) that is suited for a continuing agent-environment interface and integrates naturally into agent designs that scale to complex environments. The approach, continuing PSRL, maintains a statistically plausible model of the environment and follows a policy that maximizes expected $\gamma$-discounted return in that model. At each time, with probability $1-\gamma$, the model is replaced by a sample from the posterior distribution over environments. For a choice of discount factor that suitably depends on the horizon $T$, we establish an $\tilde{O}(\tau S \sqrt{A T})$ bound on the Bayesian regret, where $S$ is the number of environment states, $A$ is the number of actions, and $\tau$ denotes the reward averaging time, which is a bound on the duration required to accurately estimate the average reward of any policy. Our work is the first to formalize and rigorously analyze the resampling approach with randomized exploration.
翻译:我们提出了一种适用于持续智能体-环境交互的后验采样强化学习(PSRL)扩展方法,该方法能够自然地融入扩展到复杂环境的智能体设计中。该持续PSRL方法维护一个具有统计合理性的环境模型,并遵循在该模型中最大化期望$\gamma$-折扣回报的策略。在每个时刻,以概率$1-\gamma$,模型被替换为从环境后验分布中采样的样本。对于依赖于时间步长$T$的适当折扣因子选择,我们建立了贝叶斯遗憾的$\tilde{O}(\tau S \sqrt{A T})$上界,其中$S$为环境状态数,$A$为动作数,$\tau$表示奖励平均时间——即准确估计任意策略平均回报所需持续时间的界。本文首次对具有随机探索的重采样方法进行了形式化定义与严格分析。