Posterior sampling allows exploitation of prior knowledge on the environment's transition dynamics to improve the sample efficiency of reinforcement learning. The prior is typically specified as a class of parametric distributions, the design of which can be cumbersome in practice, often resulting in the choice of uninformative priors. In this work, we propose a novel posterior sampling approach in which the prior is given as a (partial) causal graph over the environment's variables. The latter is often more natural to design, such as listing known causal dependencies between biometric features in a medical treatment study. Specifically, we propose a hierarchical Bayesian procedure, called C-PSRL, simultaneously learning the full causal graph at the higher level and the parameters of the resulting factored dynamics at the lower level. We provide an analysis of the Bayesian regret of C-PSRL that explicitly connects the regret rate with the degree of prior knowledge. Our numerical evaluation conducted in illustrative domains confirms that C-PSRL strongly improves the efficiency of posterior sampling with an uninformative prior while performing close to posterior sampling with the full causal graph.
翻译:后验采样允许利用环境转移动力学的先验知识来提升强化学习的样本效率。先验通常被指定为一类参数分布,而在实践中,这类分布的设计往往较为繁琐,常导致采用无信息先验。本文提出了一种新颖的后验采样方法,其中的先验以环境变量上的(部分)因果图形式给出。后者在设计上通常更为自然,例如在医学治疗研究中列出生理特征之间已知的因果依赖关系。具体而言,我们提出了一种名为C-PSRL的分层贝叶斯过程,它在高层同时学习完整因果图,在低层学习由此产生的因子化动力学的参数。我们对C-PSRL的贝叶斯遗憾进行了分析,明确将遗憾率与先验知识的程度相关联。在若干示例领域中进行的数值评估证实,C-PSRL在使用无信息先验时显著提升了后验采样的效率,同时其表现接近使用完整因果图的后验采样。