In nonstationary bandit learning problems, the decision-maker must continually gather information and adapt their action selection as the latent state of the environment evolves. In each time period, some latent optimal action maximizes expected reward under the environment state. We view the optimal action sequence as a stochastic process, and take an information-theoretic approach to analyze attainable performance. We bound limiting per-period regret in terms of the entropy rate of the optimal action process. The bound applies to a wide array of problems studied in the literature and reflects the problem's information structure through its information-ratio.
翻译:在非平稳赌博机学习问题中,决策者必须持续收集信息,并根据环境潜在状态的变化调整其动作选择。在每个时间段内,存在某个潜在最优动作可使环境状态下的期望收益最大化。我们将最优动作序列视为一个随机过程,并采用信息论方法分析可实现的性能。我们通过最优动作过程的熵率来约束每期极限遗憾值。该约束适用于文献中研究的广泛问题,并通过其信息比反映了问题的信息结构。