Sequential decision-making domains such as recommender systems, healthcare and education often have unobserved heterogeneity in the population that can be modeled using latent bandits $-$ a framework where an unobserved latent state determines the model for a trajectory. While the latent bandit framework is compelling, the extent of its generality is unclear. We first address this by establishing a de Finetti theorem for decision processes, and show that $\textit{every}$ exchangeable and coherent stateless decision process is a latent bandit. The latent bandit framework lends itself particularly well to online learning with offline datasets, a problem of growing interest in sequential decision-making. One can leverage offline latent bandit data to learn a complex model for each latent state, so that an agent can simply learn the latent state online to act optimally. We focus on a linear model for a latent bandit with $d_A$-dimensional actions, where the latent states lie in an unknown $d_K$-dimensional subspace for $d_K \ll d_A$. We present SOLD, a novel principled method to learn this subspace from short offline trajectories with guarantees. We then provide two methods to leverage this subspace online: LOCAL-UCB and ProBALL-UCB. We demonstrate that LOCAL-UCB enjoys $\tilde O(\min(d_A\sqrt{T}, d_K\sqrt{T}(1+\sqrt{d_AT/d_KN})))$ regret guarantees, where the effective dimension is lower when the size $N$ of the offline dataset is larger. ProBALL-UCB enjoys a slightly weaker guarantee, but is more practical and computationally efficient. Finally, we establish the efficacy of our methods using experiments on both synthetic data and real-life movie recommendation data from MovieLens.
翻译:在推荐系统、医疗保健和教育等序列决策领域中,群体内部常存在未被观测的异质性,这类问题可通过隐式赌博机框架建模——该框架中未被观测的隐状态决定了轨迹的生成模型。尽管隐式赌博机框架具有吸引力,但其通用性范围尚不明确。我们首先通过建立决策过程的德菲内蒂定理来解决这一问题,证明$\textit{每个}$可交换且协调的无状态决策过程都是隐式赌博机。隐式赌博机框架特别适用于结合离线数据集的在线学习问题,这是序列决策中日益受到关注的研究方向。研究者可利用离线隐式赌博机数据学习每个隐状态的复杂模型,使得智能体仅需在线学习隐状态即可实现最优决策。本文聚焦于具有$d_A$维动作的线性隐式赌博机模型,其中隐状态位于未知的$d_K$维子空间(满足$d_K \ll d_A$)。我们提出了SOLD方法——一种具有理论保证的、从短离线轨迹学习该子空间的新颖原理性方法。随后我们给出两种在线利用该子空间的方案:LOCAL-UCB与ProBALL-UCB。理论分析表明,LOCAL-UCB享有$\tilde O(\min(d_A\sqrt{T}, d_K\sqrt{T}(1+\sqrt{d_AT/d_KN})))$的遗憾上界,其中当离线数据集规模$N$增大时有效维度会降低。ProBALL-UCB虽具有稍弱的理论保证,但更具实践性和计算效率。最后,我们通过合成数据与MovieLens真实电影推荐数据的实验验证了所提方法的有效性。