In interactive systems, actions are often correlated, presenting an opportunity for more sample-efficient off-policy evaluation (OPE) and learning (OPL) in large action spaces. We introduce a unified Bayesian framework to capture these correlations through structured and informative priors. In this framework, we propose sDM, a generic Bayesian approach designed for OPE and OPL, grounded in both algorithmic and theoretical foundations. Notably, sDM leverages action correlations without compromising computational efficiency. Moreover, inspired by online Bayesian bandits, we introduce Bayesian metrics that assess the average performance of algorithms across multiple problem instances, deviating from the conventional worst-case assessments. We analyze sDM in OPE and OPL, highlighting the benefits of leveraging action correlations. Empirical evidence showcases the strong performance of sDM.
翻译:在交互系统中,动作之间往往存在相关性,这为在大动作空间中进行更具样本效率的离策略评估(OPE)与离策略学习(OPL)提供了契机。我们提出一个统一的贝叶斯框架,通过结构化的信息先验来捕捉这些相关性。在该框架下,我们提出sDM,一种基于算法与理论基础的通用贝叶斯方法,专为OPE和OPL设计。值得注意的是,sDM在利用动作相关性的同时,不牺牲计算效率。此外,受在线贝叶斯多臂老虎机启发,我们引入贝叶斯度量,用于评估算法在多个问题实例上的平均性能,这与传统的基于最坏情况的评估方式不同。我们从OPE和OPL两个角度分析了sDM,强调了利用动作相关性的优势。实验证据表明sDM具有强大的性能。