Parallel server systems in transportation, manufacturing, and computing heavily rely on dynamic routing using connected cyber components for computation and communication. Yet, these components remain vulnerable to random malfunctions and malicious attacks, motivating the need for fault-tolerant dynamic routing that are both traffic-stabilizing and cost-efficient. In this paper, we consider a parallel server system with dynamic routing subject to reliability and stability failures. For the reliability setting, we consider an infinite-horizon Markov decision process where the system operator strategically activates protection mechanism upon each job arrival based on traffic state observations. We prove an optimal deterministic threshold protecting policy exists based on dynamic programming recursion of the HJB equation. For the security setting, we extend the model to an infinite-horizon stochastic game where the attacker strategically manipulates routing assignment. We show that both players follow a threshold strategy at every Markov perfect equilibrium. For both failure settings, we also analyze the stability of the traffic queues under control. Finally, we develop approximate dynamic programming algorithms to compute the optimal/equilibrium policies, supplemented with numerical examples and experiments for validation and illustration.
翻译:交通运输、制造与计算领域的并行服务器系统高度依赖动态路由,借助互联网络计算组件实现计算与通信功能。然而,这些组件易受随机故障与恶意攻击影响,亟需兼具流量稳定性与成本效益的容错动态路由策略。本文研究存在可靠性故障与安全性故障的并行服务器系统动态路由问题。在可靠性场景中,我们构建无限时域马尔可夫决策过程,系统运营者基于流量状态观测在每次作业到达时策略性激活保护机制。通过HJB方程动态规划递归,证明存在最优确定性阈值保护策略。在安全性场景中,我们将模型扩展为无限时域随机博弈,攻击者可策略性操纵路由分配。研究表明,在每个马尔可夫完美均衡状态下,双方参与者均遵循阈值策略。针对两类故障场景,我们还分析了受控流量队列的稳定性。最终,我们开发了近似动态规划算法计算最优/均衡策略,并通过数值算例与实验验证了算法的有效性。