Continuous-time reinforcement learning tasks commonly use discrete steps of fixed cycle times for actions. As practitioners need to choose the action-cycle time for a given task, a significant concern is whether the hyper-parameters of the learning algorithm need to be re-tuned for each choice of the cycle time, which is prohibitive for real-world robotics. In this work, we investigate the widely-used baseline hyper-parameter values of two policy gradient algorithms -- PPO and SAC -- across different cycle times. Using a benchmark task where the baseline hyper-parameters of both algorithms were shown to work well, we reveal that when a cycle time different than the task default is chosen, PPO with baseline hyper-parameters fails to learn. Moreover, both PPO and SAC with their baseline hyper-parameters perform substantially worse than their tuned values for each cycle time. We propose novel approaches for setting these hyper-parameters based on the cycle time. In our experiments on simulated and real-world robotic tasks, the proposed approaches performed at least as well as the baseline hyper-parameters, with significantly better performance for most choices of the cycle time, and did not result in learning failure for any cycle time. Hyper-parameter tuning still remains a significant barrier for real-world robotics, as our approaches require some initial tuning on a new task, even though it is negligible compared to an extensive tuning for each cycle time. Our approach requires no additional tuning after the cycle time is changed for a given task and is a step toward avoiding extensive and costly hyper-parameter tuning for real-world policy optimization.
翻译:连续时间强化学习任务通常采用固定周期时间的离散动作步骤。由于实践者需要针对给定任务选择动作周期时间,一个关键问题在于:学习算法的超参数是否需要对每个周期时间选择重新调整——这在现实世界机器人领域是难以承受的。本研究考察了两种策略梯度算法——PPO与SAC——在基准超参数值下对不同周期时间的适应表现。采用先前已验证两类算法基准超参数有效的基准任务时,我们发现:当选择非任务默认的周期时间时,采用基准超参数的PPO无法学习。此外,与针对各周期时间调优后的参数值相比,采用基准超参数的PPO与SAC均表现显著更差。我们提出了基于周期时间设置这些超参数的新颖方法。在仿真和现实世界机器人实验中的结果表明,所提方法性能至少与基准超参数持平,且在多数周期时间选择下表现显著更优,未在任何周期时间导致学习失败。超参数调优仍是现实世界机器人领域的重大阻碍——尽管相较针对每个周期时间的全面调优可忽略不计,但我们的方法仍需在新任务上进行初始调优。当给定任务的周期时间变更后,我们的方法无需额外调参,这为规避现实世界策略优化中昂贵且全面的超参数调优迈出了重要一步。