Recent advancements in detector technology have significantly increased the size and complexity of experimental data, and high-performance computing (HPC) provides a path towards more efficient and timely data processing. However, movement of large data sets from acquisition systems to HPC centers introduces bottlenecks owing to storage I/O at both ends. This manuscript introduces a streaming workflow designed for an high data rate electron detector that streams data directly to compute node memory at the National Energy Research Scientific Computing Center (NERSC), thereby avoiding storage I/O. The new workflow deploys ZeroMQ-based services for data production, aggregation, and distribution for on-the-fly processing, all coordinated through a distributed key-value store. The system is integrated with the detector's science gateway and utilizes the NERSC Superfacility API to initiate streaming jobs through a web-based frontend. Our approach achieves up to a 14-fold increase in data throughput and enhances predictability and reliability compared to a I/O-heavy file-based transfer workflow. Our work highlights the transformative potential of streaming workflows to expedite data analysis for time-sensitive experiments.
翻译:近年来,探测器技术的进步显著增加了实验数据的规模和复杂性,而高性能计算为更高效、更及时的数据处理提供了途径。然而,由于采集系统与高性能计算中心两端的存储I/O限制,大规模数据集的传输往往成为瓶颈。本文介绍了一种专为高数据速率电子探测器设计的流式工作流,该工作流将数据直接流式传输至美国国家能源研究科学计算中心的计算节点内存中,从而避免了存储I/O操作。新工作流部署了基于ZeroMQ的服务,用于数据生成、聚合和实时处理分发,所有环节通过分布式键值存储进行协调。该系统与探测器的科学网关集成,并利用NERSC超级设施API,通过基于Web的前端启动流式处理任务。与基于文件传输的I/O密集型工作流相比,我们的方法实现了高达14倍的数据吞吐量提升,并增强了处理过程的可预测性和可靠性。本研究凸显了流式工作流在加速时间敏感型实验数据分析方面的变革性潜力。