Parallel server systems in transportation, manufacturing, and computing heavily rely on dynamic routing using connected cyber components for computation and communication. Yet, these components remain vulnerable to random malfunctions and malicious attacks, motivating the need for fault-tolerant dynamic routing that are both traffic-stabilizing and cost-efficient. In this paper, we consider a parallel server system with dynamic routing subject to reliability and stability failures. For the reliability setting, we consider an infinite-horizon Markov decision process where the system operator strategically activates protection mechanism upon each job arrival based on traffic state observations. We prove an optimal deterministic threshold protecting policy exists based on dynamic programming recursion of the HJB equation. For the security setting, we extend the model to an infinite-horizon stochastic game where the attacker strategically manipulates routing assignment. We show that both players follow a threshold strategy at every Markov perfect equilibrium. For both failure settings, we also analyze the stability of the traffic queues under control. Finally, we develop approximate dynamic programming algorithms to compute the optimal/equilibrium policies, supplemented with numerical examples and experiments for validation and illustration.
翻译:运输、制造和计算领域的并行服务器系统高度依赖利用互联网络组件进行动态路由以实现计算与通信。然而,这些组件易受随机故障和恶意攻击,催生了兼具流量稳定性和成本效率的容错动态路由需求。本文考虑存在动态路由的并行服务器系统面临可靠性与稳定性失效问题。在可靠性场景中,我们建立无限时域马尔可夫决策过程模型,系统算子基于流量状态观测,在每次任务到达时策略性激活保护机制。通过HJB方程的动态规划递归,证明存在最优确定性阈值保护策略。在安全性场景中,我们将模型扩展为无限时域随机博弈,攻击者可策略性操纵路由分配。证明在每个马尔可夫完美均衡中,双方参与者均遵循阈值策略。针对两种失效场景,我们分析了受控流量队列的稳定性。最后,开发近似动态规划算法以计算最优/均衡策略,并通过数值算例与实验进行验证与展示。