Consider a decision-maker that can pick one out of $K$ actions to control an unknown system, for $T$ turns. The actions are interpreted as different configurations or policies. Holding the same action fixed, the system asymptotically converges to a unique equilibrium, as a function of this action. The dynamics of the system are unknown to the decision-maker, which can only observe a noisy reward at the end of every turn. The decision-maker wants to maximize its accumulated reward over the $T$ turns. Learning what equilibria are better results in higher rewards, but waiting for the system to converge to equilibrium costs valuable time. Existing bandit algorithms, either stochastic or adversarial, achieve linear (trivial) regret for this problem. We present a novel algorithm, termed Upper Equilibrium Concentration Bound (UECB), that knows to switch an action quickly if it is not worth it to wait until the equilibrium is reached. This is enabled by employing convergence bounds to determine how far the system is from equilibrium. We prove that UECB achieves a regret of $\mathcal{O}(\log(T)+\tau_c\log(\tau_c)+\tau_c\log\log(T))$ for this equilibrium bandit problem where $\tau_c$ is the worst case approximate convergence time to equilibrium. We then show that both epidemic control and game control are special cases of equilibrium bandits, where $\tau_c\log \tau_c$ typically dominates the regret. We then test UECB numerically for both of these applications.
翻译:考虑一个决策者,可在 $T$ 轮内从 $K$ 个动作中选择一个来控制未知系统。这些动作被解释为不同的配置或策略。固定相同的动作,系统会渐近收敛到一个唯一均衡,该均衡是此动作的函数。系统动力学对决策者未知,其每轮结束后只能观测到带噪声的奖励。决策者旨在 $T$ 轮内最大化累积奖励。学习哪些均衡更优能带来更高奖励,但等待系统收敛到均衡会消耗宝贵时间。现有的赌博机算法(无论是随机型还是对抗型)对此问题只能达到线性(平凡)遗憾。我们提出一种新算法,称为上均衡置信界(UECB),该算法能快速切换动作(如果等待系统达到均衡不值得)。这通过使用收敛界来确定系统与均衡的距离来实现。我们证明,UECB 在此均衡赌博机问题上实现了 $\mathcal{O}(\log(T)+\tau_c\log(\tau_c)+\tau_c\log\log(T))$ 的遗憾,其中 $\tau_c$ 是到均衡的最坏情况近似收敛时间。随后,我们证明流行病控制和博弈控制都是均衡赌博机的特例,其中 $\tau_c\log \tau_c$ 通常主导遗憾。最后,我们对这两个应用进行 UECB 的数值测试。