Posterior sampling allows the exploitation of prior knowledge of the environment's transition dynamics to improve the sample efficiency of reinforcement learning. The prior is typically specified as a class of parametric distributions, a task that can be cumbersome in practice, often resulting in the choice of uninformative priors. In this work, we propose a novel posterior sampling approach in which the prior is given as a (partial) causal graph over the environment's variables. The latter is often more natural to design, such as listing known causal dependencies between biometric features in a medical treatment study. Specifically, we propose a hierarchical Bayesian procedure, called C-PSRL, simultaneously learning the full causal graph at the higher level and the parameters of the resulting factored dynamics at the lower level. For this procedure, we provide an analysis of its Bayesian regret, which explicitly connects the regret rate with the degree of prior knowledge. Our numerical evaluation conducted in illustrative domains confirms that C-PSRL strongly improves the efficiency of posterior sampling with an uninformative prior while performing close to posterior sampling with the full causal graph.
翻译:后验采样能够利用环境转移动态的先验知识来提升强化学习的样本效率。先验通常被指定为参数分布类,这在实践中可能较为繁琐,常导致选择无信息先验。本文提出一种新颖的后验采样方法,其中先验以环境变量上的(部分)因果图形式给出。后者通常更易于设计,例如在医学治疗研究中列举生物特征间已知的因果依赖关系。具体而言,我们提出一种分层贝叶斯过程C-PSRL,同时在高层次学习完整因果图,在低层次学习由此产生的因子化动态参数。我们对该过程进行了贝叶斯遗憾分析,明确将遗憾率与先验知识程度相关联。在示例性域中进行的数值评估证实,C-PSRL在使用无信息先验时显著提升了后验采样的效率,同时其性能接近使用完整因果图的后验采样。