We study reinforcement learning (RL) problems in which agents observe the reward or transition realizations at their current state before deciding which action to take. Such observations are available in many applications, including transactions, navigation and more. When the environment is known, previous work shows that this lookahead information can drastically increase the collected reward. However, outside of specific applications, existing approaches for interacting with unknown environments are not well-adapted to these observations. In this work, we close this gap and design provably-efficient learning algorithms able to incorporate lookahead information. To achieve this, we perform planning using the empirical distribution of the reward and transition observations, in contrast to vanilla approaches that only rely on estimated expectations. We prove that our algorithms achieve tight regret versus a baseline that also has access to lookahead information - linearly increasing the amount of collected reward compared to agents that cannot handle lookahead information.
翻译:本研究探讨了在强化学习(RL)问题中,智能体在决定采取何种行动之前,能够观测当前状态下的奖励或状态转移实现值的情形。此类观测信息广泛存在于交易、导航等诸多实际应用中。当环境模型已知时,已有研究表明这种前瞻信息能够显著提升累积奖励。然而,在特定应用场景之外,现有处理未知环境交互的方法尚未充分适配此类观测机制。本文通过设计可证明高效的学习算法弥补了这一空白,该算法能够有效整合前瞻信息。为实现这一目标,我们采用基于奖励与状态转移观测值经验分布的规划方法,这与仅依赖期望估计的传统方法形成鲜明对比。我们证明,相较于无法处理前瞻信息的智能体,本算法相对于同样具备前瞻信息访问能力的基线实现了严格的遗憾界——从而线性提升了累积奖励的获取量。