In classic reinforcement learning algorithms, agents make decisions at discrete and fixed time intervals. The duration between decisions becomes a crucial hyperparameter, as setting it too short may increase the problem's difficulty by requiring the agent to make numerous decisions to achieve its goal while setting it too long can result in the agent losing control over the system. However, physical systems do not necessarily require a constant control frequency, and for learning agents, it is often preferable to operate with a low frequency when possible and a high frequency when necessary. We propose a framework called Continuous-Time Continuous-Options (CTCO), where the agent chooses options as sub-policies of variable durations. These options are time-continuous and can interact with the system at any desired frequency providing a smooth change of actions. We demonstrate the effectiveness of CTCO by comparing its performance to classical RL and temporal-abstraction RL methods on simulated continuous control tasks with various action-cycle times. We show that our algorithm's performance is not affected by the choice of environment interaction frequency. Furthermore, we demonstrate the efficacy of CTCO in facilitating exploration in a real-world visual reaching task for a 7 DOF robotic arm with sparse rewards.
翻译:在经典强化学习算法中,智能体以离散且固定的时间间隔做出决策。决策间隔成为关键超参数:设置过短可能要求智能体做出大量决策以实现目标,从而增加问题难度;设置过长则可能导致智能体丧失对系统的控制能力。然而,物理系统未必需要恒定控制频率,就学习型智能体而言,更优的策略是在可行时采用低频操作,在必要时采用高频操作。我们提出一种称为连续时间连续选项(CTCO)的框架,其中智能体将选项选择为持续时间可变的子策略。这些选项具有时间连续性,能够以任意期望频率与系统交互,实现动作的平滑切换。通过将CTCO与经典强化学习及时间抽象强化学习方法在具有不同动作周期时间的模拟连续控制任务上进行性能对比,我们证明了其有效性。实验表明,我们的算法性能不受环境交互频率选择的影响。此外,我们通过一个具有稀疏奖励的七自由度机械臂真实世界视觉到达任务,验证了CTCO在促进探索方面的卓越效能。