A key challenge to deploying reinforcement learning in practice is avoiding excessive (harmful) exploration in individual episodes. We propose a natural constraint on exploration -- \textit{uniformly} outperforming a conservative policy (adaptively estimated from all data observed thus far), up to a per-episode exploration budget. We design a novel algorithm that uses a UCB reinforcement learning policy for exploration, but overrides it as needed to satisfy our exploration constraint with high probability. Importantly, to ensure unbiased exploration across the state space, our algorithm adaptively determines when to explore. We prove that our approach remains conservative while minimizing regret in the tabular setting. We experimentally validate our results on a sepsis treatment task and an HIV treatment task, demonstrating that our algorithm can learn while ensuring good performance compared to the baseline policy for every patient; the latter task also demonstrates that our approach extends to continuous state spaces via deep reinforcement learning.
翻译:在现实环境中部署强化学习的关键挑战在于避免单个回合中出现过度(有害)探索。本文提出一种对探索的自然约束——即均匀优于(根据迄今观测到的所有数据自适应估计的)保守策略,且每一回合的探索预算受限。我们设计了一种新颖算法,该算法利用基于UCB的强化学习策略进行探索,但在需要时能高概率地覆盖该策略以满足所提探索约束。重要的是,为确保对整个状态空间的无偏探索,算法能自适应地决定何时进行探索。我们证明,在表格型环境中,该方法在保持保守性的同时实现了遗憾最小化。我们在脓毒症治疗任务和艾滋病治疗任务上进行了实验验证,结果表明该算法能在学习的同时确保每位患者的治疗效果不低于基线策略;后者任务还证明该方法可通过深度强化学习扩展至连续状态空间。