Real-world reinforcement learning often faces environment drift, but most existing methods rely on static entropy coefficients/target entropy, causing over-exploration during stable periods and under-exploration after drift (thus slow recovery), and leaving unanswered the principled question of how exploration intensity should scale with drift magnitude. We prove that entropy scheduling under non-stationarity can be reduced to a one-dimensional, round-by-round trade-off, faster tracking of the optimal solution after drift vs. avoiding gratuitous randomness when the environment is stable, so exploration strength can be driven by measurable online drift signals. Building on this, we propose AES (Adaptive Entropy Scheduling), which adaptively adjusts the entropy coefficient/temperature online using observable drift proxies during training, requiring almost no structural changes and incurring minimal overhead. Across 4 algorithm variants, 12 tasks, and 4 drift modes, AES significantly reduces the fraction of performance degradation caused by drift and accelerates recovery after abrupt changes.
翻译:现实世界中的强化学习常面临环境漂移问题,但现有方法大多依赖静态熵系数/目标熵,导致稳定期过度探索、漂移后探索不足(进而恢复缓慢),且未能从原理上回答探索强度应如何随漂移幅度调整的问题。我们证明非平稳性下的熵调度可简化为一种逐轮进行的一维权衡:在环境漂移后快速追踪最优解,与在环境稳定时避免无谓随机性之间的平衡,从而使探索强度可由可测量的在线漂移信号驱动。基于此,我们提出AES(自适应熵调度)方法,该方法在训练过程中利用可观测的漂移代理指标自适应地在线调整熵系数/温度参数,几乎无需改动算法结构且计算开销极小。在4种算法变体、12项任务和4种漂移模式的实验中,AES显著降低了由漂移引起的性能退化比例,并加速了突变后的恢复过程。