Railway Turnout Machines (RTMs) are mission-critical components of the railway transportation infrastructure, responsible for directing trains onto desired tracks. For safety assurance applications, especially in early-warning scenarios, RTM faults are expected to be detected as early as possible on a continuous 7x24 basis. However, limited emphasis has been placed on distributed model inference frameworks that can meet the inference latency and reliability requirements of such mission critical fault diagnosis systems. In this paper, an edge-cloud collaborative early-warning system is proposed to enable real-time and downtime-tolerant fault diagnosis of RTMs, providing a new paradigm for the deployment of models in safety-critical scenarios. Firstly, a modular fault diagnosis model is designed specifically for distributed deployment, which utilizes a hierarchical architecture consisting of the prior knowledge module, subordinate classifiers, and a fusion layer for enhanced accuracy and parallelism. Then, a cloud-edge collaborative framework leveraging pipeline parallelism, namely CEC-PA, is developed to minimize the overhead resulting from distributed task execution and context exchange by strategically partitioning and offloading model components across cloud and edge. Additionally, an election consensus mechanism is implemented within CEC-PA to ensure system robustness during coordinator node downtime. Comparative experiments and ablation studies are conducted to validate the effectiveness of the proposed distributed fault diagnosis approach. Our ensemble-based fault diagnosis model achieves a remarkable 97.4% accuracy on a real-world dataset collected by Nanjing Metro in Jiangsu Province, China. Meanwhile, CEC-PA demonstrates superior recovery proficiency during node disruptions and speed-up ranging from 1.98x to 7.93x in total inference time compared to its counterparts.
翻译:铁路道岔转辙机(RTMs)是铁路运输基础设施中的关键任务组件,负责引导列车驶入预定轨道。对于安全保障应用,尤其是在预警场景中,需要以连续7x24小时为基础,尽可能早地检测出RTM故障。然而,目前能够满足此类关键任务故障诊断系统推理延迟和可靠性要求的分布式模型推理框架尚未得到充分重视。本文提出了一种边云协同预警系统,以实现RTMs的实时与容宕故障诊断,为安全关键场景下的模型部署提供了新范式。首先,专门为分布式部署设计了一个模块化故障诊断模型,该模型采用分层架构,包含先验知识模块、从属分类器以及一个融合层,以提升准确性和并行性。随后,开发了一个利用管道并行技术的边云协同框架,即CEC-PA,通过策略性地在云端和边缘划分与卸载模型组件,以最小化分布式任务执行和上下文交换带来的开销。此外,CEC-PA内部实现了一种选举共识机制,以确保在协调节点宕机期间系统的鲁棒性。通过对比实验和消融研究验证了所提出的分布式故障诊断方法的有效性。我们基于集成学习的故障诊断模型在中国江苏省南京地铁采集的真实数据集上取得了97.4%的优异准确率。同时,与其他方案相比,CEC-PA在节点中断期间展现出卓越的恢复能力,并且总推理时间实现了1.98倍至7.93倍的加速。