Recent advancements in detector technology have significantly increased the size and complexity of experimental data, and high-performance computing (HPC) provides a path towards more efficient and timely data processing. However, movement of large data sets from acquisition systems to HPC centers introduces bottlenecks owing to storage I/O at both ends. This manuscript introduces a streaming workflow designed for an high data rate electron detector that streams data directly to compute node memory at the National Energy Research Scientific Computing Center (NERSC), thereby avoiding storage I/O. The new workflow deploys ZeroMQ-based services for data production, aggregation, and distribution for on-the-fly processing, all coordinated through a distributed key-value store. The system is integrated with the detector's science gateway and utilizes the NERSC Superfacility API to initiate streaming jobs through a web-based frontend. Our approach achieves up to a 14-fold increase in data throughput and enhances predictability and reliability compared to a I/O-heavy file-based transfer workflow. Our work highlights the transformative potential of streaming workflows to expedite data analysis for time-sensitive experiments.
翻译:近期探测器技术的进步显著增加了实验数据的规模和复杂性,高性能计算(HPC)为更高效、及时的数据处理提供了途径。然而,从采集系统向HPC中心传输大规模数据集时,由于两端存储I/O的限制,容易引入瓶颈。本文介绍了一种为高数据率电子探测器设计的流式工作流,该工作流将数据直接流式传输至美国国家能源研究科学计算中心(NERSC)的计算节点内存,从而避免存储I/O。新工作流部署了基于ZeroMQ的服务,用于数据生产、聚合和分发,以支持实时处理,并通过分布式键值存储进行协调。该系统与探测器的科学网关集成,并利用NERSC Superfacility API通过基于Web的前端启动流式任务。与依赖密集I/O的文件传输工作流相比,我们的方法实现了高达14倍的数据吞吐量提升,并增强了可预测性和可靠性。本研究突显了流式工作流在加速时间敏感型实验数据分析方面的变革潜力。