We develop an extension of posterior sampling for reinforcement learning (PSRL) that is suited for a continuing agent-environment interface and integrates naturally into agent designs that scale to complex environments. The approach, continuing PSRL, maintains a statistically plausible model of the environment and follows a policy that maximizes expected $\gamma$-discounted return in that model. At each time, with probability $1-\gamma$, the model is replaced by a sample from the posterior distribution over environments. For a choice of discount factor that suitably depends on the horizon $T$, we establish an $\tilde{O}(\tau S \sqrt{A T})$ bound on the Bayesian regret, where $S$ is the number of environment states, $A$ is the number of actions, and $\tau$ denotes the reward averaging time, which is a bound on the duration required to accurately estimate the average reward of any policy. Our work is the first to formalize and rigorously analyze the resampling approach with randomized exploration.
翻译:我们为强化学习中的后验采样(PSRL)提出了一种扩展方法,该方法适用于持续智能体-环境交互界面,并能自然地集成到可扩展至复杂环境的智能体设计中。该方法——持续PSRL——维护一个统计上合理的环境模型,并遵循一个在该模型中最大化期望 $\gamma$ 折现回报的策略。在每个时间步,以 $1-\gamma$ 的概率,该模型会被从环境后验分布中抽取的一个样本所替换。对于适当依赖于时间范围 $T$ 的折扣因子选择,我们建立了贝叶斯遗憾的 $\tilde{O}(\tau S \sqrt{A T})$ 上界,其中 $S$ 是环境状态数,$A$ 是动作数,$\tau$ 表示奖励平均时间,即准确估计任何策略的平均奖励所需时长的上界。我们的工作是首次对随机探索的重采样方法进行形式化和严格分析。