Deep reinforcement learning (DRL) is increasingly applied in large-scale productions like Netflix and Facebook. As with most data-driven systems, DRL systems can exhibit undesirable behaviors due to environmental drifts, which often occur in constantly-changing production settings. Continual Learning (CL) is the inherent self-healing approach for adapting the DRL agent in response to the environment's conditions shifts. However, successive shifts of considerable magnitude may cause the production environment to drift from its original state. Recent studies have shown that these environmental drifts tend to drive CL into long, or even unsuccessful, healing cycles, which arise from inefficiencies such as catastrophic forgetting, warm-starting failure, and slow convergence. In this paper, we propose Dr. DRL, an effective self-healing approach for DRL systems that integrates a novel mechanism of intentional forgetting into vanilla CL to overcome its main issues. Dr. DRL deliberately erases the DRL system's minor behaviors to systematically prioritize the adaptation of the key problem-solving skills. Using well-established DRL algorithms, Dr. DRL is compared with vanilla CL on various drifted environments. Dr. DRL is able to reduce, on average, the healing time and fine-tuning episodes by, respectively, 18.74% and 17.72%. Dr. DRL successfully helps agents to adapt to 19.63% of drifted environments left unsolved by vanilla CL while maintaining and even enhancing by up to 45% the obtained rewards for drifted environments that are resolved by both approaches.
翻译:深度强化学习(DRL)正越来越多地应用于Netflix和Facebook等大规模生产系统中。与大多数数据驱动系统类似,DRL系统可能因环境漂移(这在不断变化的生产环境中经常发生)而表现出不良行为。持续学习(CL)是一种内在的自愈方法,可使DRL代理适应环境条件的变化。然而,连续发生的大幅度变化可能导致生产环境偏离其初始状态。近期研究表明,这些环境漂移往往会使CL陷入漫长甚至失败的自愈循环,其原因包括灾难性遗忘、冷启动失败和收敛缓慢等低效问题。本文提出Dr. DRL,一种有效的DRL系统自愈方法,它将有意识遗忘这一创新机制整合到经典CL中,以克服其主要问题。Dr. DRL有选择地清除DRL系统中次要行为,系统性地优先调整关键问题解决技能的适应性。通过使用成熟的DRL算法,我们在多种漂移环境中将Dr. DRL与经典CL进行对比。实验结果表明,Dr. DRL平均可减少18.74%的自愈时间和17.72%的微调回合数。对于经典CL未能解决的漂移环境,Dr. DRL成功帮助代理适应其中19.63%的环境,同时对于两种方法均能解决的漂移环境,获得的奖励最高可提升45%。