We revisit the finite-armed linear bandit model by Nelson et al. (2022), where contexts and rewards are governed by a finite hidden Markov chain. Nelson et al. (2022) approach this model by a reduction to linear contextual bandits; but to do so, they actually introduce a simplification in which rewards are linear functions of the posterior probabilities over the hidden states given the observed contexts, rather than functions of the hidden states themselves. Their analysis (but not their algorithm) also does not take into account the estimation of the HMM parameters, and only tackles expected, not high-probability, bounds, which suffer in addition from unnecessary complex dependencies on the model (like reward gaps). We instead study the more natural model incorporating direct dependencies in the hidden states (on top of dependencies on the observed contexts, as is natural for contextual bandits) and also obtain stronger, high-probability, regret bounds for a fully adaptive strategy that estimates HMM parameters online. These bounds do not depend on the reward functions and only depend on the model through the estimation of the HMM parameters.
翻译:我们重新审视了Nelson等人(2022)提出的有限臂线性赌博机模型,其中情境和奖励由有限隐马尔可夫链控制。Nelson等人(2022)通过将其简化为线性情境赌博机来处理该模型;但在此过程中,他们实际上引入了一种简化:奖励被定义为基于观测情境的后验概率(而非隐状态本身)的线性函数。此外,他们的理论分析(而非算法)未考虑隐马尔可夫模型参数的估计,仅给出了期望意义(而非高概率)的界,且这些界还带有不必要的复杂模型依赖性(例如奖励间隔)。相比之下,我们研究了一个更自然的模型,该模型直接包含隐状态中的依赖关系(同时保留情境赌博机中观测情境的依赖),并针对在线估计隐马尔可夫模型参数的全自适应策略,获得了更强的高概率遗憾界。这些界既不依赖于奖励函数,也仅通过隐马尔可夫模型参数的估计与模型相关。