Multi-armed bandit (MAB) processes constitute a foundational subclass of reinforcement learning problems and represent a central topic in statistical decision theory. Yet, conducting valid sequential testing under adaptive allocation remains challenging due to the lack of asymptotic theory under non-i.i.d. reward sequences and sublinear sample sizes for some arms. To address this open challenge, we propose an Urn Bandit (UNB) process to integrate the reinforcement mechanism of urn probabilistic models with MAB principles, ensuring almost sure concentration of allocation proportions on optimal arms. We establish a joint functional central limit theorem (FCLT) for consistent estimators of expected rewards under non-i.i.d. reward sequences with non-sub-Gaussian tails and pairwise cross-arm dependence. To overcome the limitations of existing methods that focus mainly on cumulative regret and therefore provide only algorithmic performance guarantees without supporting valid sequential testing, we develop an asymptotic theory for sequential test statistics under the proposed UNB process. The resulting framework enables a broad class of sequential inference procedures, such as A/B testing and policy evaluation. Simulation studies and real data analysis demonstrate that UNB maintains testing performance comparable to that of the equal randomization (ER) design while achieving improved reward accumulation relative to ER.
翻译:多臂赌博机(MAB)过程构成了强化学习问题的基础子类,并在统计决策理论中占据核心地位。然而,在自适应分配机制下进行有效的序贯检验仍面临挑战,因为非独立同分布奖励序列缺乏渐近理论支持,且部分臂的样本量呈次线性增长。为解决这一开放性问题,我们提出瓮模型赌博机(UNB)过程,将瓮概率模型的强化机制与MAB原理相结合,确保分配比例在最优臂上几乎必然收敛。针对非独立同分布奖励序列(具有非次高斯尾部及成对交叉臂依赖)中一致估计量的期望奖励,我们建立了联合泛函中心极限定理(FCLT)。现有方法主要关注累积遗憾,仅能提供算法性能保证而无法支持有效的序贯检验,为克服其局限性,我们在所提出的UNB过程下发展了序贯检验统计量的渐近理论。该框架能支持广泛的序贯推断程序,如A/B测试与策略评估。仿真研究与真实数据分析表明,UNB在保持与等随机化(ER)设计相当的检验效能的同时,相较于ER实现了更优的奖励累积。