Thompson sampling (TS) serves as a solution for addressing the exploitation-exploration dilemma in Bayesian optimization (BO). While it prioritizes exploration by randomly generating and maximizing sample paths of Gaussian process (GP) posteriors, TS weakly manages its exploitation by gathering information about the true objective function after each exploration is performed. In this study, we incorporate the epsilon-greedy ($\varepsilon$-greedy) policy, a well-established selection strategy in reinforcement learning, into TS to improve its exploitation. We first delineate two extremes of TS applied for BO, namely the generic TS and a sample-average TS. The former and latter promote exploration and exploitation, respectively. We then use $\varepsilon$-greedy policy to randomly switch between the two extremes. A small value of $\varepsilon \in (0,1)$ prioritizes exploitation, and vice versa. We empirically show that $\varepsilon$-greedy TS with an appropriate $\varepsilon$ is better than one of its two extremes and competes with the other.
翻译:汤普森抽样(TS)是解决贝叶斯优化(BO)中探索-利用困境的一种方法。它通过随机生成并最大化高斯过程(GP)后验的样本路径来优先探索,但在每次探索后仅通过收集真实目标函数的信息来弱化其利用能力。在本研究中,我们将强化学习中成熟的ε-贪心($\varepsilon$-greedy)选择策略引入TS以增强其利用能力。我们首先描述了TS在BO中的两种极端形式,即通用TS和样本平均TS,前者促进探索,后者促进利用。随后,我们利用ε-贪心策略在两种极端之间随机切换。较小的ε∈(0,1)值优先考虑利用,反之优先考虑探索。实验表明,采用适当ε的ε-贪心TS优于其两种极端形式之一,并与另一种极端形式相媲美。