Evidence-based targeting has been a topic of growing interest among the practitioners of policy and business. Formulating decision-maker's policy learning as a fixed-budget best arm identification (BAI) problem with contextual information, we study an optimal adaptive experimental design for policy learning with multiple treatment arms. In the sampling stage, the planner assigns treatment arms adaptively over sequentially arriving experimental units upon observing their contextual information (covariates). After the experiment, the planner recommends an individualized assignment rule to the population. Setting the worst-case expected regret as the performance criterion of adaptive sampling and recommended policies, we derive its asymptotic lower bounds, and propose a strategy, Adaptive Sampling-Policy Learning strategy (PLAS), whose leading factor of the regret upper bound aligns with the lower bound as the size of experimental units increases.
翻译:基于证据的精准施策在政策和商业实践者中日益受到关注。将决策者的策略学习形式化为具有上下文信息的固定预算最优臂识别(BAI)问题,我们研究了多处理臂情境下策略学习的最优自适应实验设计。在采样阶段,规划者根据观察到的序贯到达实验单元的上下文信息(协变量)自适应分配处理臂。实验结束后,规划者向总体推荐个体化分配规则。以最坏情况期望遗憾作为自适应采样与推荐策略的性能准则,我们推导出其渐近下界,并提出一种名为“自适应采样-策略学习策略”(PLAS)的方法,随着实验单元规模扩大,其遗憾上界的主导因子与下界趋于一致。