A New Bandit Setting Balancing Information from State Evolution and Corrupted Context

We propose a new sequential decision-making setting, combining key aspects of two established online learning problems with bandit feedback. The optimal action to play at any given moment is contingent on an underlying changing state which is not directly observable by the agent. Each state is associated with a context distribution, possibly corrupted, allowing the agent to identify the state. Furthermore, states evolve in a Markovian fashion, providing useful information to estimate the current state via state history. In the proposed problem setting, we tackle the challenge of deciding on which of the two sources of information the agent should base its arm selection. We present an algorithm that uses a referee to dynamically combine the policies of a contextual bandit and a multi-armed bandit. We capture the time-correlation of states through iteratively learning the action-reward transition model, allowing for efficient exploration of actions. Our setting is motivated by adaptive mobile health (mHealth) interventions. Users transition through different, time-correlated, but only partially observable internal states, determining their current needs. The side information associated with each internal state might not always be reliable, and standard approaches solely rely on the context risk of incurring high regret. Similarly, some users might exhibit weaker correlations between subsequent states, leading to approaches that solely rely on state transitions risking the same. We analyze our setting and algorithm in terms of regret lower bound and upper bounds and evaluate our method on simulated medication adherence intervention data and several real-world data sets, showing improved empirical performance compared to several popular algorithms.

翻译：我们提出了一种新的序贯决策设定，融合了具有强盗反馈的两个经典在线学习问题的核心要素。在任意时刻，最优动作取决于一个潜在的、智能体不可直接观测的时变状态。每个状态关联一个可能被污染的上下文分布，使智能体能够识别该状态。此外，状态以马尔可夫方式演化，通过状态历史为估计当前状态提供有用信息。在该问题设定中，我们面临的核心挑战是：智能体应如何决定基于两种信息源（上下文或状态演化）中的哪一种来选取臂。我们提出一种算法，通过裁判机制动态结合上下文强盗与多臂强盗的策略。通过迭代学习动作-奖励转移模型来捕获状态的时间相关性，从而实现高效的臂探索。该设定受自适应移动健康干预的启发：用户在不同时域相关但仅部分可观测的内部状态间转移，这些状态决定其当前需求。每个内部状态关联的辅助信息可能并非始终可靠，而标准方法仅依赖上下文信息可能导致高遗憾。类似地，部分用户的状态间相关性可能较弱，仅依赖状态转移的方法同样面临此风险。我们从遗憾下界和上界角度分析该设定与算法，并在模拟药物依从性干预数据及多个真实数据集上评估所提方法。实验结果表明，与多种主流算法相比，本方法具有更优的经验性能。