Policy learning "without'' overlap: Pessimism and generalized empirical Bernstein's inequality

This paper studies offline policy learning, which aims at utilizing observations collected a priori (from either fixed or adaptively evolving behavior policies) to learn the optimal individualized decision rule in a given class. Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics are lower bounded in the offline dataset. In other words, the performance of these methods depends on the worst-case propensity in the offline dataset. As one has no control over the data collection process, this assumption can be unrealistic in many situations, especially when the behavior policies are allowed to evolve over time with diminishing propensities. In this paper, we propose a new algorithm that optimizes lower confidence bounds (LCBs) -- instead of point estimates -- of the policy values. The LCBs are constructed by quantifying the estimation uncertainty of the augmented inverse propensity weighted (AIPW)-type estimators using knowledge of the behavior policies for collecting the offline data. Without assuming any uniform overlap condition, we establish a data-dependent upper bound for the suboptimality of our algorithm, which depends only on (i) the overlap for the optimal policy, and (ii) the complexity of the policy class. As an implication, for adaptively collected data, we ensure efficient policy learning as long as the propensities for optimal actions are lower bounded over time, while those for suboptimal ones are allowed to diminish arbitrarily fast. In our theoretical analysis, we develop a new self-normalized concentration inequality for IPW estimators, generalizing the well-known empirical Bernstein's inequality to unbounded and non-i.i.d. data.

翻译：本文研究离线政策学习问题，其目标在于利用事前收集的观测数据（来源于固定或自适应演化的行为策略）学习给定类别中最优个体化决策规则。现有政策学习方法依赖于统一重叠假设，即离线数据集中所有个体特征探索所有行动的概率存在下界。换言之，这些方法的性能取决于离线数据集中的最差倾向性。由于数据收集过程不可控，该假设在许多情境下并不现实，尤其当行为策略允许随时间演化且倾向性逐渐减小时。本文提出一种新算法，该算法优化政策值的置信下界（LCB）而非点估计。通过利用收集离线数据的行为策略知识量化增广逆概率加权（AIPW）型估计量的不确定性，构建置信下界。在无需任何统一重叠条件的情况下，我们建立了算法次优性的数据依赖上界，该上界仅取决于：（i）最优策略的重叠程度；（ii）策略类的复杂度。这一结果表明，对于自适应收集的数据，只要最优行动的倾向性随时间有下界，而次优行动的倾向性可任意快速减小，我们仍能保证高效的政策学习。在理论分析中，我们为逆概率加权（IPW）估计量发展了一种新的自归一化浓度不等式，将著名的经验伯恩斯坦不等式推广至无界非独立同分布数据。