Efficient exploration is a key challenge in contextual bandits due to the large size of their action space, where uninformed exploration can result in computational and statistical inefficiencies. Fortunately, the rewards of actions are often correlated and this can be leveraged to explore them efficiently. In this work, we capture such correlations using pre-trained diffusion models; upon which we design diffusion Thompson sampling (dTS). Both theoretical and algorithmic foundations are developed for dTS, and empirical evaluation also shows its favorable performance.
翻译:高效探索是上下文赌博机中的一个关键挑战,原因在于其动作空间规模庞大,未经指导的探索可能导致计算与统计上的低效。值得注意的是,不同动作的奖励往往存在相关性,这一特性可用于实现高效探索。本研究通过预训练扩散模型来捕捉此类相关性,并据此设计了扩散汤普森采样(dTS)。我们从理论与算法两个层面为dTS奠定了坚实基础,实证评估也表明其性能表现优异。