Reinforcement learning (RL) has been shown to learn sophisticated control policies for complex tasks including games, robotics, heating and cooling systems and text generation. The action-perception cycle in RL, however, generally assumes that a measurement of the state of the environment is available at each time step without a cost. In applications such as deep-sea and planetary robot exploration, materials design and medicine, however, there can be a high cost associated with measuring, or even approximating, the state of the environment. In this paper, we survey the recently growing literature that adopts the perspective that an RL agent might not need, or even want, a costly measurement at each time step. Within this context, we propose the Deep Dynamic Multi-Step Observationless Agent (DMSOA), contrast it with the literature and empirically evaluate it on OpenAI gym and Atari Pong environments. Our results, show that DMSOA learns a better policy with fewer decision steps and measurements than the considered alternative from the literature.
翻译:强化学习已被证明能够为复杂任务学习精细的控制策略,包括游戏、机器人、暖通空调系统以及文本生成等领域。然而,强化学习中的“行动-感知”循环通常假设在每个时间步都能无成本地获取环境状态的测量值。但在诸如深海和行星机器人探索、材料设计以及医学等应用中,测量(甚至近似估算)环境状态可能伴随高昂的代价。本文综述了近期不断增长的文献,这些文献采纳了一种观点:强化学习智能体在每个时间步可能并不需要,甚至不期望获取代价高昂的测量值。在此背景下,我们提出了深度动态多步无观测智能体(DMSOA),将其与现有文献进行对比分析,并在OpenAI Gym和Atari Pong环境中进行实证评估。结果表明,与文献中考虑的替代方案相比,DMSOA能以更少的决策步数和测量次数学习到更优的策略。