In interactive systems, actions are often correlated, presenting an opportunity for more sample-efficient off-policy evaluation (OPE) and learning (OPL) in large action spaces. We introduce a unified Bayesian framework to capture these correlations through structured and informative priors. In this framework, we propose sDM, a generic Bayesian approach for OPE and OPL, grounded in both algorithmic and theoretical foundations. Notably, sDM leverages action correlations without compromising computational efficiency. Moreover, inspired by online Bayesian bandits, we introduce Bayesian metrics that assess the average performance of algorithms across multiple problem instances, deviating from the conventional worst-case assessments. We analyze sDM in OPE and OPL, highlighting the benefits of leveraging action correlations. Empirical evidence showcases the strong performance of sDM.
翻译:在交互系统中,动作之间通常存在相关性,这为在大规模动作空间中进行更高样本效率的离策略评估与学习提供了机会。我们引入了一个统一的贝叶斯框架,通过结构化和信息丰富的先验分布来捕捉这些相关性。在此框架内,我们提出了sDM——一种基于算法与理论基础的、通用的贝叶斯离策略评估与学习方法。值得注意的是,sDM在利用动作相关性的同时,并未牺牲计算效率。此外,受在线贝叶斯多臂赌博机的启发,我们引入了贝叶斯度量指标,用于评估算法在多个问题实例上的平均性能,从而偏离了传统的基于最坏情况的评估范式。我们分析了sDM在离策略评估与学习中的表现,重点阐述了利用动作相关性所带来的优势。实证结果展示了sDM的优异性能。