时滞控制系统中的强化学习：全面综述 (Reinforcement Learning for Control Systems with Time Delays: A Comprehensive Survey)

In the last decade, Reinforcement Learning (RL) has achieved remarkable success in the control and decision-making of complex dynamical systems. However, most RL algorithms rely on the Markov Decision Process assumption, which is violated in practical cyber-physical systems affected by sensing delays, actuation latencies, and communication constraints. Such time delays introduce memory effects that can significantly degrade performance and compromise stability, particularly in networked and multi-agent environments. This paper presents a comprehensive survey of RL methods designed to address time delays in control systems. We first formalize the main classes of delays and analyze their impact on the Markov property. We then systematically categorize existing approaches into five major families: state augmentation and history-based representations, recurrent policies with learned memory, predictor-based and model-aware methods, robust and domain-randomized training strategies, and safe RL frameworks with explicit constraint handling. For each family, we discuss underlying principles, practical advantages, and inherent limitations. A comparative analysis highlights key trade-offs among these approaches and provides practical guidelines for selecting suitable methods under different delay characteristics and safety requirements. Finally, we identify open challenges and promising research directions, including stability certification, large-delay learning, multi-agent communication co-design, and standardized benchmarking. This survey aims to serve as a unified reference for researchers and practitioners developing reliable RL-based controllers in delay-affected cyber-physical systems.

翻译：过去十年中，强化学习在复杂动态系统的控制与决策领域取得了显著成就。然而，大多数强化学习算法依赖于马尔可夫决策过程假设，这一假设在实际受传感延迟、执行滞后和通信约束影响的网络物理系统中往往无法成立。此类时滞会引入记忆效应，可能显著降低系统性能并危及稳定性，在网络化与多智能体环境中尤为突出。本文系统综述了针对控制系统中时滞问题设计的强化学习方法。我们首先形式化定义了主要时滞类别，并分析了其对马尔可夫性质的影响。随后将现有方法系统归纳为五大体系：状态增广与基于历史的表示方法、具有学习记忆能力的循环策略、预测器驱动与模型感知方法、鲁棒性与领域随机化训练策略，以及具备显式约束处理机制的安全强化学习框架。针对每个体系，我们深入探讨了其基本原理、实践优势与固有局限。通过对比分析揭示了各类方法间的核心权衡关系，并为不同时滞特性与安全需求下的方法选择提供了实用指南。最后，本文指出了稳定性认证、大时滞学习、多智能体通信协同设计及标准化基准测试等开放挑战与前瞻研究方向。本综述旨在为在受时滞影响的网络物理系统中开发可靠强化学习控制器的研究者与实践者提供统一的参考依据。