This paper studies offline policy learning, which aims at utilizing observations collected a priori (from either fixed or adaptively evolving behavior policies) to learn an optimal individualized decision rule that achieves the best overall outcomes for a given population. Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded. As one has no control over the data collection process, this assumption can be unrealistic in many situations, especially when the behavior policies are allowed to evolve over time with diminishing propensities for certain actions. In this paper, we propose Pessimistic Policy Learning (PPL), a new algorithm that optimizes lower confidence bounds (LCBs) -- instead of point estimates -- of the policy values. The LCBs are constructed using knowledge of the behavior policies for collecting the offline data. Without assuming any uniform overlap condition, we establish a data-dependent upper bound for the suboptimality of our algorithm, which only depends on (i) the overlap for the optimal policy, and (ii) the complexity of the policy class we optimize over. As an implication, for adaptively collected data, we ensure efficient policy learning as long as the propensities for optimal actions are lower bounded over time, while those for suboptimal ones are allowed to diminish arbitrarily fast. In our theoretical analysis, we develop a new self-normalized type concentration inequality for inverse-propensity-weighting estimators, generalizing the well-known empirical Bernstein's inequality to unbounded and non-i.i.d. data. We complement our theory with an efficient optimization algorithm via Majorization-Minimization and policy tree search, as well as extensive simulation studies and real-world applications that demonstrate the efficacy of PPL.
翻译:本文研究离线策略学习,其目标在于利用先验收集的观测数据(来自固定或自适应演化的行为策略)来学习最优的个性化决策规则,从而在给定群体中获得最佳整体结果。现有策略学习方法依赖于均匀重叠假设,即所有个体特征下探索各行动倾向必须存在统一下界。由于研究者无法控制数据收集过程,该假设在许多实际场景中可能不成立,尤其当行为策略允许随时间演化且某些行动的倾向趋于衰减时。本文提出悲观策略学习算法,该算法通过优化策略值的置信下界而非点估计来构建决策规则。置信下界的构造利用了离线数据收集过程中行为策略的先验知识。在不依赖任何均匀重叠假设的前提下,我们建立了算法次优性的数据依赖上界,该上界仅取决于:(一)最优策略的重叠程度;(二)待优化策略类的复杂度。由此推论,对于自适应收集的数据,只要最优行动的倾向随时间保持下界,即使次优行动的倾向以任意速度衰减,我们仍能保证高效的策略学习。在理论分析中,我们针对逆倾向加权估计量提出了一种新型自归一化集中不等式,将经典经验伯恩斯坦不等式推广至无界且非独立同分布数据。我们通过Majorization-Minimization优化算法与策略树搜索实现了高效计算,并辅以大量仿真研究与实际应用案例,共同验证了PPL算法的有效性。