As AI becomes more prevalent throughout society, effective methods of integrating humans and AI systems that leverage their respective strengths and mitigate risk have become an important priority. In this paper, we introduce the paradigm of super policy learning that takes advantage of Human-AI interaction for data driven sequential decision making. This approach utilizes the observed action, either from AI or humans, as input for achieving a stronger oracle in policy learning for the decision maker (humans or AI). In the decision process with unmeasured confounding, the actions taken by past agents can offer valuable insights into undisclosed information. By including this information for the policy search in a novel and legitimate manner, the proposed super policy learning will yield a super-policy that is guaranteed to outperform both the standard optimal policy and the behavior one (e.g., past agents' actions). We call this stronger oracle a blessing from human-AI interaction. Furthermore, to address the issue of unmeasured confounding in finding super-policies using the batch data, a number of nonparametric and causal identifications are established under the framework of proximal causal inference. Building upon on these novel identification results, we develop several super-policy learning algorithms and systematically study their theoretical properties such as finite-sample regret guarantee. Finally, we illustrate the effectiveness of our proposal through extensive simulations and real-world applications.
翻译:随着人工智能在社会中日益普及,如何有效整合人类与AI系统、发挥各自优势并降低风险,已成为重要课题。本文提出超级策略学习范式,利用人机交互实现数据驱动的序贯决策。该方法将观测到的来自AI或人类的动作作为输入,为决策者(人类或AI)获取更强的策略学习神谕。在存在未测量混杂的决策过程中,历史智能体采取的动作可提供关于未披露信息的重要洞见。通过以新颖且合理的方式将该信息纳入策略搜索,所提出的超级策略学习将产生能确保优于标准最优策略和行为策略(如历史智能体的动作)的超级策略。我们将这种更强的神谕称为人机交互的馈赠。此外,为解决使用批次数据寻找超级策略时的未测量混杂问题,我们在近端因果推断框架下建立了若干非参数因果识别条件。基于这些新的识别结果,我们开发了多种超级策略学习算法,并系统研究了其有限样本遗憾保证等理论性质。最后,通过大量仿真和实际应用验证了所提方法的有效性。