Many reinforcement learning (RL) applications have combinatorial action spaces, where each action is a composition of sub-actions. A standard RL approach ignores this inherent factorization structure, resulting in a potential failure to make meaningful inferences about rarely observed sub-action combinations; this is particularly problematic for offline settings, where data may be limited. In this work, we propose a form of linear Q-function decomposition induced by factored action spaces. We study the theoretical properties of our approach, identifying scenarios where it is guaranteed to lead to zero bias when used to approximate the Q-function. Outside the regimes with theoretical guarantees, we show that our approach can still be useful because it leads to better sample efficiency without necessarily sacrificing policy optimality, allowing us to achieve a better bias-variance trade-off. Across several offline RL problems using simulators and real-world datasets motivated by healthcare, we demonstrate that incorporating factored action spaces into value-based RL can result in better-performing policies. Our approach can help an agent make more accurate inferences within underexplored regions of the state-action space when applying RL to observational datasets.
翻译:许多强化学习应用涉及组合型动作空间,其中每个动作由子动作组合而成。标准强化学习方法忽略了这种固有的分解结构,导致对罕见子动作组合的推理可能失效——这在数据可能有限的离线场景中尤其成问题。本文提出一种由因子化动作空间诱导的线性Q函数分解方法。我们研究了该方法的理论性质,识别了在使用其近似Q函数时保证零偏差的场景。在理论保证范围之外,我们证明该方法仍具价值:它在不必然牺牲策略最优性的前提下提升样本效率,从而实现更优的偏差-方差权衡。通过利用医疗领域激励的仿真器与真实世界数据集进行多个离线强化学习实验,我们证明将因子化动作空间融入基于价值的强化学习可生成更优策略。该方法能帮助智能体在将强化学习应用于观测数据集时,对状态-动作空间未被充分探索的区域做出更准确的推理。