This paper revisits the bandit problem in the Bayesian setting. The Bayesian approach formulates the bandit problem as an optimization problem, and the goal is to find the optimal policy which minimizes the Bayesian regret. One of the main challenges facing the Bayesian approach is that computation of the optimal policy is often intractable, especially when the length of the problem horizon or the number of arms is large. In this paper, we first show that under a suitable rescaling, the Bayesian bandit problem converges toward a continuous Hamilton-Jacobi-Bellman (HJB) equation. The optimal policy for the limiting HJB equation can be explicitly obtained for several common bandit problems, and we give numerical methods to solve the HJB equation when an explicit solution is not available. Based on these results, we propose an approximate Bayes-optimal policy for solving Bayesian bandit problems with large horizons. Our method has the added benefit that its computational cost does not increase as the horizon increases.
翻译:本文重新审视了贝叶斯框架下的赌博机问题。贝叶斯方法将赌博机问题表述为优化问题,其目标是找到使贝叶斯遗憾最小化的最优策略。贝叶斯方法面临的主要挑战之一是最优策略的计算通常难以处理,尤其是在问题时间跨度或臂数较大时。本文首先证明,在适当的重新标度下,贝叶斯赌博机问题收敛于一个连续的Hamilton-Jacobi-Bellman (HJB)方程。对于若干常见的赌博机问题,极限HJB方程的最优策略可以显式获得,并在无法获得显式解时给出求解HJB方程的数值方法。基于这些结果,我们提出了一种用于求解长时域贝叶斯赌博机问题的近似贝叶斯最优策略。该方法的一个额外优势是其计算成本不随时间跨度增加而增加。