We consider a Multi-Armed Bandit problem in which the rewards are non-stationary and are dependent on past actions and potentially on past contexts. At the heart of our method, we employ a recurrent neural network, which models these sequences. In order to balance between exploration and exploitation, we present an energy minimization term that prevents the neural network from becoming too confident in support of a certain action. This term provably limits the gap between the maximal and minimal probabilities assigned by the network. In a diverse set of experiments, we demonstrate that our method is at least as effective as methods suggested to solve the sub-problem of Rotting Bandits, and can solve intuitive extensions of various benchmark problems. We share our implementation at https://github.com/rotmanmi/Energy-Regularized-RNN.
翻译:我们考虑一个多臂赌博机问题,其中奖励是非平稳的,且依赖于过去的行为及可能的上下文。在方法核心中,我们采用循环神经网络来建模这些序列。为了平衡探索与利用,我们引入一个能量最小化项,防止神经网络对支持某个特定动作过度自信。该术语可证明地限制网络分配的最大与最小概率之间的差距。在一系列多样化实验中,我们证明该方法至少与针对腐化赌博机子问题提出的方法同样有效,并能解决多种基准问题的直观扩展。我们已在 https://github.com/rotmanmi/Energy-Regularized-RNN 共享实现代码。