Policy learning algorithms are widely used in areas such as personalized medicine and advertising to develop individualized treatment regimes. However, most methods force a decision even when predictions are uncertain, which is risky in high-stakes settings. We study policy learning with abstention, where a policy may defer to a safe default or an expert. When a policy abstains, it receives a small additive reward on top of the value of a random guess. We propose a two-stage learner that first identifies a set of near-optimal policies and then constructs an abstention rule from their disagreements. We establish fast O(1/n)-type regret guarantees when propensities are known, and extend these guarantees to the unknown-propensity case via a doubly robust (DR) objective. We further show that abstention is a versatile tool with direct applications to other core problems in policy learning: it yields improved guarantees under margin conditions without the common realizability assumption, connects to distributionally robust policy learning by hedging against small data shifts, and supports safe policy improvement by ensuring improvement over a baseline policy with high probability.
翻译:策略学习算法在个性化医疗和广告等领域的个体化治疗方案制定中得到了广泛应用。然而,大多数方法即使在预测不确定时也强制做出决策,这在高风险场景中存在风险。我们研究了带有弃权机制的策略学习,即策略可以选择推迟决策,转而采用安全的默认方案或专家意见。当策略选择弃权时,除了随机猜测的价值外,它还会获得一个较小的附加奖励。我们提出了一种两阶段学习器:首先识别一组接近最优的策略,然后根据这些策略之间的分歧构建弃权规则。当倾向得分已知时,我们建立了快速的O(1/n)型遗憾保证,并通过双重稳健(DR)目标将这些保证扩展到倾向得分未知的情况。我们进一步证明,弃权机制是一种多功能工具,可直接应用于策略学习中的其他核心问题:在无需常见可实现性假设的情况下,通过边界条件获得改进的保证;通过抵御小规模数据偏移,与分布鲁棒策略学习建立联系;并通过高概率确保对基线策略的改进,支持安全策略提升。