Reinforcement learning (RL) has been shown to learn sophisticated control policies for complex tasks including games, robotics, heating and cooling systems and text generation. The action-perception cycle in RL, however, generally assumes that a measurement of the state of the environment is available at each time step without a cost. In applications such as materials design, deep-sea and planetary robot exploration and medicine, however, there can be a high cost associated with measuring, or even approximating, the state of the environment. In this paper, we survey the recently growing literature that adopts the perspective that an RL agent might not need, or even want, a costly measurement at each time step. Within this context, we propose the Deep Dynamic Multi-Step Observationless Agent (DMSOA), contrast it with the literature and empirically evaluate it on OpenAI gym and Atari Pong environments. Our results, show that DMSOA learns a better policy with fewer decision steps and measurements than the considered alternative from the literature. The corresponding code is available at: \url{https://github.com/cbellinger27/Learning-when-to-observe-in-RL
翻译:强化学习已在游戏、机器人、供暖制冷系统及文本生成等复杂任务中展现出学习精细化控制策略的能力。然而,强化学习中的"行动-感知"循环通常假设每个时间步均可无成本获取环境状态的测量值。但在材料设计、深海与行星机器人探索及医学等应用中,测量(甚至近似估算)环境状态可能伴随高昂成本。本文综述了近年来新兴的研究视角——强化学习智能体可能不需要(甚至不希望)在每个时间步都进行昂贵测量。在此基础上,我们提出深度动态多步无观测智能体(DMSOA),将其与现有文献进行对比,并在OpenAI Gym和Atari Pong环境中进行实证评估。结果表明,相较于文献中考虑的替代方案,DMSOA能够以更少的决策步数和测量次数学习到更优策略。相关代码开源地址:\url{https://github.com/cbellinger27/Learning-when-to-observe-in-RL}