Sequential decision-making algorithms such as multi-armed bandits can find optimal personalized decisions, but are notoriously sample-hungry. In personalized medicine, for example, training a bandit from scratch for every patient is typically infeasible, as the number of trials required is much larger than the number of decision points for a single patient. To combat this, latent bandits offer rapid exploration and personalization beyond what context variables alone can offer, provided that a latent variable model of problem instances can be learned consistently. However, existing works give no guidance as to how such a model can be found. In this work, we propose an identifiable latent bandit framework that leads to optimal decision-making with a shorter exploration time than classical bandits by learning from historical records of decisions and outcomes. Our method is based on nonlinear independent component analysis that provably identifies representations from observational data sufficient to infer optimal actions in new bandit instances. We verify this strategy in simulated and semi-synthetic environments, showing substantial improvement over online and offline learning baselines when identifying conditions are satisfied.
翻译:序贯决策算法(如多臂老虎机)能够找到最优个性化决策,但其样本需求量极大。以个性化医疗为例,为每位患者从头训练老虎机通常不可行,因为所需试验次数远大于单个患者的决策点数量。为此,潜变量老虎机通过利用问题实例的潜变量模型实现快速探索与个性化——其能力远超仅依赖上下文变量的方法,但前提是该模型能被一致地学习。然而,现有研究并未给出如何找到此类模型的指导。本文提出一种可辨识潜变量老虎机框架,通过学习历史决策与结果记录,在比经典老虎机更短的探索时间内实现最优决策。该方法基于非线性独立成分分析,可从观测数据中可辨识地提取表征,这些表征足以推断新老虎机实例中的最优行动。我们在仿真与半合成环境中验证了该策略,当可辨识条件满足时,其性能显著优于在线与离线学习基线。