We study the robustness of deep reinforcement learning algorithms against distribution shifts within contextual multi-stage stochastic combinatorial optimization problems from the operations research domain. In this context, risk-sensitive algorithms promise to learn robust policies. While this field is of general interest to the reinforcement learning community, most studies up-to-date focus on theoretical results rather than real-world performance. With this work, we aim to bridge this gap by formally deriving a novel risk-sensitive deep reinforcement learning algorithm while providing numerical evidence for its efficacy. Specifically, we introduce discrete Soft Actor-Critic for the entropic risk measure by deriving a version of the Bellman equation for the respective Q-values. We establish a corresponding policy improvement result and infer a practical algorithm. We introduce an environment that represents typical contextual multi-stage stochastic combinatorial optimization problems and perform numerical experiments to empirically validate our algorithm's robustness against realistic distribution shifts, without compromising performance on the training distribution. We show that our algorithm is superior to risk-neutral Soft Actor-Critic as well as to two benchmark approaches for robust deep reinforcement learning. Thereby, we provide the first structured analysis on the robustness of reinforcement learning under distribution shifts in the realm of contextual multi-stage stochastic combinatorial optimization problems.
翻译:我们研究运营研究领域中,深度强化学习算法在面对上下文多阶段随机组合优化问题中的分布偏移时的鲁棒性。在此背景下,风险敏感算法有望学习鲁棒策略。尽管该领域对强化学习社区具有普遍意义,但迄今多数研究侧重于理论结果而非实际性能。本研究旨在通过形式化推导一种新型风险敏感深度强化学习算法,并为其有效性提供数值证据,从而弥合这一差距。具体而言,我们通过推导熵风险测度下的贝尔曼方程对应Q值版本,引入离散软演员-评论家算法。我们建立了相应的策略改进结论并推演出实用算法。我们构建了代表典型上下文多阶段随机组合优化问题的环境,并开展数值实验以实证验证算法在抵御现实分布偏移时的鲁棒性,同时保持训练分布上的性能不受损失。实验表明,我们的算法优于风险中性软演员-评论家算法以及两种鲁棒深度强化学习基准方法。由此,我们首次针对上下文多阶段随机组合优化问题领域内分布偏移下的强化学习鲁棒性提供了结构化分析。