We study a type of Multi-Armed Bandit (MAB) problems in which arms with a Gaussian reward feedback are clustered. Such an arm setting finds applications in many real-world problems, for example, mmWave communications and portfolio management with risky assets, as a result of the universality of the Gaussian distribution. Based on the Thompson Sampling algorithm with Gaussian prior (TSG) algorithm for the selection of the optimal arm, we propose our Thompson Sampling with Clustered arms under Gaussian prior (TSCG) specific to the 2-level hierarchical structure. We prove that by utilizing the 2-level structure, we can achieve a lower regret bound than we do with ordinary TSG. In addition, when the reward is Unimodal, we can reach an even lower bound on the regret by our Unimodal Thompson Sampling algorithm with Clustered Arms under Gaussian prior (UTSCG). Each of our proposed algorithms are accompanied by theoretical evaluation of the upper regret bound, and our numerical experiments confirm the advantage of our proposed algorithms.
翻译:本文研究一类具有高斯奖励反馈且臂被聚类的多臂老虎机问题。由于高斯分布的普适性,此类臂设置在许多现实问题中具有应用,例如毫米波通信和含风险资产的组合管理。基于采用高斯先验的汤普森采样算法进行最优臂选择,我们针对两级分层结构提出了高斯先验下聚类臂的汤普森采样算法。我们证明,通过利用两级结构,可以获得比普通高斯先验TS算法更低的遗憾界。此外,当奖励具有单峰性时,通过我们提出的高斯先验下单峰聚类臂汤普森采样算法,可以达到更低的遗憾界。我们提出的每种算法均附有遗憾上界的理论分析,数值实验也验证了所提算法的优势。