Empirical researchers and decision-makers spanning various domains frequently seek profound insights into the long-term impacts of interventions. While the significance of long-term outcomes is undeniable, an overemphasis on them may inadvertently overshadow short-term gains. Motivated by this, this paper formalizes a new framework for learning the optimal policy that effectively balances both long-term and short-term rewards, where some long-term outcomes are allowed to be missing. In particular, we first present the identifiability of both rewards under mild assumptions. Next, we deduce the semiparametric efficiency bounds, along with the consistency and asymptotic normality of their estimators. We also reveal that short-term outcomes, if associated, contribute to improving the estimator of the long-term reward. Based on the proposed estimators, we develop a principled policy learning approach and further derive the convergence rates of regret and estimation errors associated with the learned policy. Extensive experiments are conducted to validate the effectiveness of the proposed method, demonstrating its practical applicability.
翻译:各领域实证研究人员与决策者常需深入探究干预措施的长期影响。尽管长期结果的重要性毋庸置疑,但过度关注长期结果可能无意中忽视短期收益。受此启发,本文正式提出一种学习最优策略的新框架,该框架能够有效平衡长期与短期奖励,并允许部分长期结果存在缺失。具体而言,我们首先在温和假设下证明了两种奖励的可识别性。接着推导了半参数效率界,以及其估计量的一致性和渐近正态性。我们进一步揭示,若短期结果具有相关性,可辅助改进长期奖励估计量。基于所提估计量,我们开发了具有理论依据的策略学习方法,并进一步推导了学习策略的遗憾收敛速率及估计误差。通过大量实验验证了所提方法的有效性,展示了其实用价值。