We study off-policy learning (OPL) of contextual bandit policies in large discrete action spaces where existing methods -- most of which rely crucially on reward-regression models or importance-weighted policy gradients -- fail due to excessive bias or variance. To overcome these issues in OPL, we propose a novel two-stage algorithm, called Policy Optimization via Two-Stage Policy Decomposition (POTEC). It leverages clustering in the action space and learns two different policies via policy- and regression-based approaches, respectively. In particular, we derive a novel low-variance gradient estimator that enables to learn a first-stage policy for cluster selection efficiently via a policy-based approach. To select a specific action within the cluster sampled by the first-stage policy, POTEC uses a second-stage policy derived from a regression-based approach within each cluster. We show that a local correctness condition, which only requires that the regression model preserves the relative expected reward differences of the actions within each cluster, ensures that our policy-gradient estimator is unbiased and the second-stage policy is optimal. We also show that POTEC provides a strict generalization of policy- and regression-based approaches and their associated assumptions. Comprehensive experiments demonstrate that POTEC provides substantial improvements in OPL effectiveness particularly in large and structured action spaces.
翻译:我们研究大规模离散动作空间中上下文赌博机策略的离线学习(OPL),现有方法——其中大多数严重依赖奖励回归模型或重要性加权策略梯度——由于过度的偏差或方差而失效。为了克服OPL中的这些问题,我们提出了一种新颖的两阶段算法,称为通过两阶段策略分解的策略优化(POTEC)。它利用动作空间中的聚类,并分别通过基于策略和基于回归的方法学习两种不同的策略。特别地,我们推导了一种新颖的低方差梯度估计器,能够通过基于策略的方法高效地学习用于聚类选择的第一阶段策略。为了在第一阶段策略采样的聚类内选择特定动作,POTEC使用由每个聚类内基于回归的方法导出的第二阶段策略。我们表明,一个局部正确性条件——仅要求回归模型保留聚类内动作的相对期望奖励差异——确保了我们的策略梯度估计器是无偏的且第二阶段策略是最优的。我们还表明,POTEC提供了基于策略和基于回归方法及其相关假设的严格泛化。综合实验表明,POTEC在OPL有效性上提供了显著改进,特别是在大规模和结构化的动作空间中。