We consider the upper confidence bound strategy for Gaussian multi-armed bandits with known control horizon sizes $N$ and build its limiting description with a system of stochastic differential equations and ordinary differential equations. Rewards for the arms are assumed to have unknown expected values and known variances. A set of Monte-Carlo simulations was performed for the case of close distributions of rewards, when mean rewards differ by the magnitude of order $N^{-1/2}$, as it yields the highest normalized regret, to verify the validity of the obtained description. The minimal size of the control horizon when the normalized regret is not noticeably larger than maximum possible was estimated.
翻译:针对已知控制时域大小$N$的高斯多臂赌博机,我们考虑其上置信界策略,并利用随机微分方程组与常微分方程组构建其极限描述。假设各臂奖赏的期望值未知但方差已知。当奖赏均值差异量级为$N^{-1/2}$时(此时归一化遗憾值最大),针对奖赏分布相近情形进行了蒙特卡洛模拟,以验证所获描述的有效性。同时估算了使归一化遗憾值未显著超过最大可能值所需的最小控制时域规模。