Multi-armed bandit (MAB) processes constitute a foundational subclass of reinforcement learning problems and represent a central topic in statistical decision theory, but are limited to simultaneous adaptive allocation and sequential test, because of the absence of asymptotic theory under non-i.i.d sequence and sublinear information. To address this open challenge, we propose Urn Bandit (UNB) process to integrate the reinforcement mechanism of urn probabilistic models with MAB principles, ensuring almost sure convergence of resource allocation to optimal arms. We establish the joint functional central limit theorem (FCLT) for consistent estimators of expected rewards under non-i.i.d., non-sub-Gaussian and sublinear reward samples with pairwise correlations across arms. To overcome the limitations of existing methods that focus mainly on cumulative regret, we establish the asymptotic theory along with adaptive allocation that serves powerful sequential test, such as arms comparison, A/B testing, and policy valuation. Simulation studies and real data analysis demonstrate that UNB maintains statistical test performance of equal randomization (ER) design but obtain more average rewards like classical MAB processes.
翻译:多臂老虎机(MAB)过程是强化学习问题的一个基础子类,也是统计决策理论的核心课题,但由于在非独立同分布序列和次线性信息下缺乏渐近理论,其应用局限于同时进行自适应分配与序贯检验。为解决这一开放性挑战,我们提出瓮老虎机(UNB)过程,将瓮概率模型的强化机制与MAB原理相结合,确保资源分配几乎必然收敛至最优臂。针对臂间存在两两相关性、非独立同分布、非亚高斯且次线性的奖励样本,我们为期望奖励的一致估计量建立了联合泛函中心极限定理(FCLT)。为克服现有方法主要关注累积遗憾的局限,我们建立了结合自适应分配的渐近理论,该理论可为臂比较、A/B测试和策略评估等强大序贯检验提供支撑。仿真研究与实际数据分析表明,UNB在保持等随机化(ER)设计统计检验性能的同时,能获得类似经典MAB过程的更高平均奖励。