Adaptive experiments produce dependent data that break i.i.d. assumptions that underlie classical concentration bounds and invalidate standard learning guarantees. In this paper, we develop a self-normalized maximal inequality for martingale empirical processes. Building on this, we first propose an adaptive sample-variance penalization procedure which balances empirical loss and sample variance, valid for general dependent data. Next, this allows us to derive a new variance-regularized pessimistic off-policy learning objective, for which we establish excess-risk guarantees. Subsequently, we show that, when combined with sequential updates and under standard complexity and margin conditions, the resulting estimator achieves fast convergence rates in both parametric and nonparametric regimes, improving over the usual $1/\sqrt{n}$ baseline. We complement our theoretical findings with numerical simulations that illustrate the practical gains of our approach.
翻译:自适应实验产生依赖数据,破坏了经典集中不等式所基于的独立同分布假设,并使标准学习保证失效。本文针对鞅经验过程提出了一个自归一化极大不等式。基于此,我们首先提出了一种适用于一般依赖数据的自适应样本方差惩罚方法,该方法能平衡经验损失与样本方差。随后,这使我们能够推导出一种新的方差正则化悲观离策略学习目标,并为其建立了超额风险保证。进一步地,我们证明,当结合序贯更新并在标准复杂度与边界条件下,所得估计量在参数化与非参数化机制中均能实现快速收敛速率,优于常规的$1/\sqrt{n}$基准。我们通过数值模拟补充了理论结果,以说明所提方法的实际优势。