Evidence-based targeting has been a topic of growing interest among the practitioners of policy and business. Formulating decision-maker's policy learning as a fixed-budget best arm identification (BAI) problem with contextual information, we study an optimal adaptive experimental design for policy learning with multiple treatment arms. In the sampling stage, the planner assigns treatment arms adaptively over sequentially arriving experimental units upon observing their contextual information (covariates). After the experiment, the planner recommends an individualized assignment rule to the population. Setting the worst-case expected regret as the performance criterion of adaptive sampling and recommended policies, we derive its asymptotic lower bounds, and propose a strategy, Adaptive Sampling-Policy Learning strategy (PLAS), whose leading factor of the regret upper bound aligns with the lower bound as the size of experimental units increases.
翻译:基于证据的精准施策在政策制定和商业实践中日益受到关注。将决策者的策略学习问题建模为具有上下文信息的固定预算最优臂识别(BAI)问题,我们研究面向多处理臂的最优自适应实验设计。在采样阶段,规划者根据序贯到达的实验单元观测到的上下文信息(协变量),自适应地分配处理臂。实验结束后,规划者向总体推荐个性化分配规则。以最坏情况下的期望遗憾作为自适应采样与推荐策略的性能准则,我们推导其渐近下界,并提出名为自适应采样-策略学习策略(PLAS)的方法。随着实验单元规模增大,该策略遗憾上界的主导项与下界渐近一致。