In bandit algorithms, the randomly time-varying adaptive experimental design makes it difficult to apply traditional limit theorems to off-policy evaluation of the treatment effect. Moreover, the normal approximation by the central limit theorem becomes unsatisfactory for lack of information due to the small sample size of the inferior arm. To resolve this issue, we introduce a backwards asymptotic expansion method and prove the validity of this scheme based on the partial mixing, that was originally introduced for the expansion of the distribution of a functional of a jump-diffusion process in a random environment. The theory is generalized in this paper to incorporate the backward propagation of random functions in the bandit algorithm. Besides the analytical validation, the simulation studies also support the new method. Our formulation is general and applicable to nonlinearly parametrized differentiable statistical models having an adaptive design.
翻译:在老虎机算法中,随时间随机变化的自适应实验设计使得传统极限定理难以应用于治疗效果的离策略评估。此外,由于劣势臂的样本量较小导致信息不足,中心极限定理的正态近似效果也不理想。为解决这一问题,我们引入了一种反向渐近展开方法,并基于部分混合理论证明了该方案的可行性——该理论最初用于随机环境中跳跃-扩散过程泛函分布的展开。本文将该理论推广至老虎机算法中随机函数的反向传播。除了解析验证,模拟研究也证实了新方法的有效性。我们的公式具有普适性,适用于具有自适应设计的非线性参数化可微统计模型。