To design effective digital interventions, experimenters face the challenge of learning decision policies that balance multiple objectives using offline data. Often, they aim to develop policies that maximize goal outcomes, while ensuring there are no undesirable changes in guardrail outcomes. To provide credible recommendations, experimenters must not only identify policies that satisfy the desired changes in goal and guardrail outcomes, but also offer probabilistic guarantees about the changes these policies induce. In practice, however, policy classes are often large, and digital experiments tend to produce datasets with small effect sizes relative to noise. In this setting, standard approaches such as data splitting or multiple testing often result in unstable policy selection and/or insufficient statistical power. In this paper, we provide safe noisy policy learning (SNPL), a novel approach that leverages the concept of algorithmic stability to address these challenges. Our method enables policy learning while simultaneously providing high-confidence guarantees using the entire dataset, avoiding the need for data-splitting. We present finite-sample and asymptotic versions of our algorithm that ensure the recommended policy satisfies high-probability guarantees for avoiding guardrail regressions and/or achieving goal outcome improvements. We test both variants of our approach approach empirically on a real-world application of personalizing SMS delivery. Our results on real-world data suggest that our approach offers dramatic improvements in settings with large policy classes and low signal-to-noise across both finite-sample and asymptotic safety guarantees, offering up to 300\% improvements in detection rates and 150\% improvements in policy gains at significantly smaller sample sizes.
翻译:为设计有效的数字干预措施,实验者面临利用离线数据学习平衡多目标决策策略的挑战。他们通常需要开发在最大化目标结果的同时,确保护栏结果不发生不良变化的策略。为提供可信建议,实验者不仅需识别满足目标与护栏结果预期变化的策略,还需提供关于这些策略所引发变化的概率保证。然而实践中,策略类别往往庞大,且数字实验产生的数据集通常存在效应量相对于噪声较小的问题。在此背景下,标准方法(如数据分割或多重检验)常导致策略选择不稳定和/或统计功效不足。本文提出安全噪声策略学习(SNPL)方法,通过算法稳定性概念应对这些挑战。该方法支持在利用完整数据集进行策略学习的同时,提供高置信度保证,无需数据分割。我们提出有限样本与渐近版本的算法,确保推荐策略满足避免护栏退化及/或实现目标结果改进的高概率保证。我们在个性化短信发送的真实应用中对两种算法变体进行实证检验。真实数据结果表明,在策略类别庞大且信噪比较低的场景中,我们的方法在有限样本与渐近安全保证方面均带来显著改进:在显著更小的样本量下,检测率提升最高达300%,策略增益提升最高达150%。