We develop an extension of posterior sampling for reinforcement learning (PSRL) that is suited for a continuing agent-environment interface and integrates naturally into agent designs that scale to complex environments. The approach, continuing PSRL, maintains a statistically plausible model of the environment and follows a policy that maximizes expected $\gamma$-discounted return in that model. At each time, with probability $1-\gamma$, the model is replaced by a sample from the posterior distribution over environments. For a choice of discount factor that suitably depends on the horizon $T$, we establish an $\tilde{O}(\tau S \sqrt{A T})$ bound on the Bayesian regret, where $S$ is the number of environment states, $A$ is the number of actions, and $\tau$ denotes the reward averaging time, which is a bound on the duration required to accurately estimate the average reward of any policy. Our work is the first to formalize and rigorously analyze the resampling approach with randomized exploration.
翻译:我们提出了一种适用于持续智能体-环境交互界面的强化学习后验采样(PSRL)扩展方法,该方法能自然地集成到适应复杂环境的智能体设计中。这种称为持续PSRL的方法维持一个统计意义上合理的环境模型,并遵循在该模型中最大化期望$\gamma$折现回报的策略。在每个时间步,以$1-\gamma$的概率将当前模型替换为从环境后验分布中采样的新模型。通过选择与时间范围$T$相适应的折扣因子,我们建立了贝叶斯遗憾的$\tilde{O}(\tau S \sqrt{A T})$上界,其中$S$表示环境状态数,$A$表示动作数,$\tau$代表奖励平均时间(即准确估计任意策略平均奖励所需时长的上界)。本研究首次对随机探索的重采样方法进行了形式化描述与严格分析。