RE-SAC: Disentangling aleatoric and epistemic risks in bus fleet control: A stable and robust ensemble DRL approach

Bus holding control is challenging due to stochastic traffic and passenger demand. While deep reinforcement learning (DRL) shows promise, standard actor-critic algorithms suffer from Q-value instability in volatile environments. A key source of this instability is the conflation of two distinct uncertainties: aleatoric uncertainty (irreducible noise) and epistemic uncertainty (data insufficiency). Treating these as a single risk leads to value underestimation in noisy states, causing catastrophic policy collapse. We propose a robust ensemble soft actor-critic (RE-SAC) framework to explicitly disentangle these uncertainties. RE-SAC applies Integral Probability Metric (IPM)-based weight regularization to the critic network to hedge against aleatoric risk, providing a smooth analytical lower bound for the robust Bellman operator without expensive inner-loop perturbations. To address epistemic risk, a diversified Q-ensemble penalizes overconfident value estimates in sparsely covered regions. This dual mechanism prevents the ensemble variance from misidentifying noise as a data gap, a failure mode identified in our ablation study. Experiments in a realistic bidirectional bus corridor simulation demonstrate that RE-SAC achieves the highest cumulative reward (approx. -0.4e6) compared to vanilla SAC (-0.55e6). Mahalanobis rareness analysis confirms that RE-SAC reduces Oracle Q-value estimation error by up to 62% in rare out-of-distribution states (MAE of 1647 vs. 4343), demonstrating superior robustness under high traffic variability.

翻译：公交车调度控制因随机性交通流与乘客需求而充满挑战。尽管深度强化学习（DRL）展现出应用潜力，但标准演员-评论家算法在高动态环境中存在Q值不稳定性问题。其关键根源在于两类不确定性的混淆：偶然性不确定性（不可约噪声）与认知性不确定性（数据不足）。将两者作为单一风险处理会导致在噪声状态下出现价值低估，进而引发策略灾难性崩溃。本文提出鲁棒集成软演员-评论家（RE-SAC）框架，显式解耦上述两类不确定性。RE-SAC对评论家网络施加基于积分概率度量（IPM）的权重正则化以对冲偶然性风险，在不引入昂贵内循环扰动的情况下，为鲁棒贝尔曼算子提供平滑解析下界。针对认知性风险，多样化Q值集成通过惩罚稀疏覆盖区域中的过度自信估值实现。这种双重机制可防止集成方差将噪声误判为数据缺口——这一故障模式已在消融实验中得到确证。在真实双向公交走廊仿真场景中的实验表明，相较于标准SAC算法（-0.55e6），RE-SAC实现了最高累积奖励（约-0.4e6）。马氏距离稀有性分析证实，在罕见分布外状态下，RE-SAC将Oracle Q值估计误差最高降低62%（MAE 1647 vs 4343），展现出应对高交通变异性时的卓越鲁棒性。