Recent research has turned to Reinforcement Learning (RL) to solve challenging decision problems, as an alternative to hand-tuned heuristics. RL can learn good policies without the need for modeling the environment's dynamics. Despite this promise, RL remains an impractical solution for many real-world systems problems. A particularly challenging case occurs when the environment changes over time, i.e. it exhibits non-stationarity. In this work, we characterize the challenges introduced by non-stationarity, shed light on the range of approaches to them and develop a robust framework for addressing them to train RL agents in live systems. Such agents must explore and learn new environments, without hurting the system's performance, and remember them over time. To this end, our framework (i) identifies different environments encountered by the live system, (ii) triggers exploration when necessary, (iii) takes precautions to retain knowledge from prior environments, and (iv) employs safeguards to protect the system's performance when the RL agent makes mistakes. We apply our framework to two systems problems, straggler mitigation and adaptive video streaming, and evaluate it against a variety of alternative approaches using real-world and synthetic data. We show that all components of the framework are necessary to cope with non-stationarity and provide guidance on alternative design choices for each component.
翻译:近期研究转向强化学习作为人工调优启发式算法的替代方案,以解决具有挑战性的决策问题。强化学习无需对环境动态进行建模即可学习优质策略。尽管前景广阔,但在许多现实系统问题中,强化学习仍是不可行的解决方案。当环境随时间变化(即呈现非平稳性)时,尤其构成严峻挑战。本研究刻画了非平稳性引入的挑战,阐释了应对这些挑战的多种方法,并开发了一套稳健框架以训练强化学习智能体在真实系统中运行。此类智能体必须在不损害系统性能的前提下探索并学习新环境,且能长期记忆经验。为此,我们的框架能够:(i) 识别真实系统遇到的不同环境,(ii) 在必要时触发探索机制,(iii) 采取预防措施保留先前环境的知识,以及 (iv) 在强化学习智能体出现错误时启用保护机制。我们将该框架应用于两个系统问题——滞后缓解与自适应视频流传输,并使用真实数据与合成数据对比评估多种替代方案。实验表明,框架所有组件对于应对非平稳性都不可或缺,同时我们为各组件的替代设计方案提供了指导性建议。