Bus holding control is challenging due to stochastic traffic and passenger demand. While deep reinforcement learning (DRL) shows promise, standard actor-critic algorithms suffer from Q-value instability in volatile environments. A key source of this instability is the conflation of two distinct uncertainties: aleatoric uncertainty (irreducible noise) and epistemic uncertainty (data insufficiency). Treating these as a single risk leads to value underestimation in noisy states, causing catastrophic policy collapse. We propose a robust ensemble soft actor-critic (RE-SAC) framework to explicitly disentangle these uncertainties. RE-SAC applies Integral Probability Metric (IPM)-based weight regularization to the critic network to hedge against aleatoric risk, providing a smooth analytical lower bound for the robust Bellman operator without expensive inner-loop perturbations. To address epistemic risk, a diversified Q-ensemble penalizes overconfident value estimates in sparsely covered regions. This dual mechanism prevents the ensemble variance from misidentifying noise as a data gap, a failure mode identified in our ablation study. Experiments in a realistic bidirectional bus corridor simulation demonstrate that RE-SAC achieves the highest cumulative reward (approx. -0.4e6) compared to vanilla SAC (-0.55e6). Mahalanobis rareness analysis confirms that RE-SAC reduces Oracle Q-value estimation error by up to 62% in rare out-of-distribution states (MAE of 1647 vs. 4343), demonstrating superior robustness under high traffic variability.
翻译:公交车调度控制因随机性交通流与乘客需求而充满挑战。尽管深度强化学习(DRL)展现出应用潜力,但标准演员-评论家算法在高动态环境中存在Q值不稳定性问题。其关键根源在于两类不确定性的混淆:偶然性不确定性(不可约噪声)与认知性不确定性(数据不足)。将两者作为单一风险处理会导致在噪声状态下出现价值低估,进而引发策略灾难性崩溃。本文提出鲁棒集成软演员-评论家(RE-SAC)框架,显式解耦上述两类不确定性。RE-SAC对评论家网络施加基于积分概率度量(IPM)的权重正则化以对冲偶然性风险,在不引入昂贵内循环扰动的情况下,为鲁棒贝尔曼算子提供平滑解析下界。针对认知性风险,多样化Q值集成通过惩罚稀疏覆盖区域中的过度自信估值实现。这种双重机制可防止集成方差将噪声误判为数据缺口——这一故障模式已在消融实验中得到确证。在真实双向公交走廊仿真场景中的实验表明,相较于标准SAC算法(-0.55e6),RE-SAC实现了最高累积奖励(约-0.4e6)。马氏距离稀有性分析证实,在罕见分布外状态下,RE-SAC将Oracle Q值估计误差最高降低62%(MAE 1647 vs 4343),展现出应对高交通变异性时的卓越鲁棒性。