Deep reinforcement learning (RL) has shown immense potential for learning to control systems through data alone. However, one challenge deep RL faces is that the full state of the system is often not observable. When this is the case, the policy needs to leverage the history of observations to infer the current state. At the same time, differences between the training and testing environments makes it critical for the policy not to overfit to the sequence of observations it sees at training time. As such, there is an important balancing act between having the history encoder be flexible enough to extract relevant information, yet be robust to changes in the environment. To strike this balance, we look to the PID controller for inspiration. We assert the PID controller's success shows that only summing and differencing are needed to accumulate information over time for many control tasks. Following this principle, we propose two architectures for encoding history: one that directly uses PID features and another that extends these core ideas and can be used in arbitrary control tasks. When compared with prior approaches, our encoders produce policies that are often more robust and achieve better performance on a variety of tracking tasks. Going beyond tracking tasks, our policies achieve 1.7x better performance on average over previous state-of-the-art methods on a suite of high dimensional control tasks.
翻译:深度强化学习在仅通过数据学习控制系统方面展现了巨大潜力。然而,深度强化学习面临的一个挑战是系统的完整状态通常不可观测。在这种情况下,策略需要利用观测历史来推断当前状态。同时,训练环境与测试环境之间的差异使得策略避免过度拟合训练时观测到的序列变得至关重要。因此,需要在历史编码器足够灵活以提取相关信息,同时保持对环境变化的鲁棒性之间取得重要平衡。为实现这一平衡,我们借鉴了PID控制器的设计理念。我们认为PID控制器的成功表明,对于许多控制任务,仅需求和与差分即可随时间累积信息。遵循这一原则,我们提出了两种历史编码架构:一种直接使用PID特征,另一种扩展了这些核心思想并可用于任意控制任务。与先前方法相比,我们的编码器生成的策略在多种跟踪任务中通常更鲁棒且性能更优。超越跟踪任务,在一系列高维控制任务中,我们的策略平均性能比先前最先进方法提升1.7倍。