Randomized experiments (or A/B tests) are widely used to evaluate interventions in dynamic systems such as recommendation platforms, marketplaces, and digital health. In these settings, interventions affect both current and future system states, so estimating the global average treatment effect (GATE) requires accounting for temporal dynamics, which is especially challenging in the presence of nonstationarity; existing approaches suffer from high bias, high variance, or both. In this paper, we address this challenge via the novel Truncated Policy Gradient (TPG) estimator, which replaces instantaneous outcomes with short-horizon outcome trajectories. The estimator admits a policy-gradient interpretation: it is a truncation of the first-order approximation to the GATE, yielding provable reductions in bias and variance in nonstationary Markovian settings. We further establish a central limit theorem for the TPG estimator and develop a consistent variance estimator that remains valid under nonstationarity with single-trajectory data. We validate our theory with two real-world case studies. The results show that a well-calibrated TPG estimator attains low bias and variance in practical nonstationary settings.
翻译:随机实验(或A/B测试)被广泛用于评估推荐平台、市场平台和数字健康等动态系统中的干预措施。在这些场景中,干预不仅影响当前系统状态,还会影响未来状态,因此估计全局平均处理效应(GATE)必须考虑时间动态性——这在非平稳性存在时尤为困难;现有方法普遍存在高偏差、高方差或两者兼有的问题。本文通过提出新颖的截断策略梯度(TPG)估计器应对这一挑战,该估计器使用短时域结果轨迹替代瞬时结果。该估计器具有策略梯度的解释:它是对GATE一阶近似的截断,能在非平稳马尔可夫环境中实现可证明的偏差与方差降低。我们进一步建立了TPG估计器的中心极限定理,并开发了一种在非平稳单轨迹数据下仍保持有效的一致性方差估计器。通过两个真实案例研究验证了理论结果。实验表明,经过良好校准的TPG估计器在实际非平稳环境中能够实现低偏差与低方差。