Bayesian bandit algorithms with approximate Bayesian inference have been widely used in real-world applications. However, there is a large discrepancy between the superior practical performance of these approaches and their theoretical justification. Previous research only indicates a negative theoretical result: Thompson sampling could have a worst-case linear regret $\Omega(T)$ with a constant threshold on the inference error measured by one $\alpha$-divergence. To bridge this gap, we propose an Enhanced Bayesian Upper Confidence Bound (EBUCB) framework that can efficiently accommodate bandit problems in the presence of approximate inference. Our theoretical analysis demonstrates that for Bernoulli multi-armed bandits, EBUCB can achieve the optimal regret order $O(\log T)$ if the inference error measured by two different $\alpha$-divergences is less than a constant, regardless of how large this constant is. Our study provides the first theoretical regret bound that is better than $o(T)$ in the setting of constant approximate inference error, to our best knowledge. Furthermore, in concordance with the negative results in previous studies, we show that only one bounded $\alpha$-divergence is insufficient to guarantee a sub-linear regret.
翻译:基于近似贝叶斯推理的贝叶斯赌博机算法在实际应用中已被广泛采用。然而,这类方法卓越的实践性能与其理论依据之间存在显著差距。先前研究仅指出一个负面理论结果:当推理误差通过某α-散度度量并存在常数阈值时,汤普森采样可能产生最坏情况下的线性遗憾$\Omega(T)$。为弥合这一差距,我们提出一种增强型贝叶斯置信上界(EBUCB)框架,该框架能有效处理存在近似推理的赌博机问题。理论分析表明,对于伯努利多臂赌博机,若通过两种不同α-散度度量的推理误差均小于某常数(无论该常数多大),则EBUCB能实现最优遗憾阶$O(\log T)$。据我们所知,本研究首次在常数近似推理误差设定下给出了优于$o(T)$的理论遗憾界。此外,与先前研究的负面结论一致,我们证明仅单一有界α-散度不足以保证次线性遗憾。