We study the performance of a cloud-based GPU-accelerated inference server to speed up event reconstruction in neutrino data batch jobs. Using detector data from the ProtoDUNE experiment and employing the standard DUNE grid job submission tools, we attempt to reprocess the data by running several thousand concurrent grid jobs, a rate we expect to be typical of current and future neutrino physics experiments. We process most of the dataset with the GPU version of our processing algorithm and the remainder with the CPU version for timing comparisons. We find that a 100-GPU cloud-based server is able to easily meet the processing demand, and that using the GPU version of the event processing algorithm is two times faster than processing these data with the CPU version when comparing to the newest CPUs in our sample. The amount of data transferred to the inference server during the GPU runs can overwhelm even the highest-bandwidth network switches, however, unless care is taken to observe network facility limits or otherwise distribute the jobs to multiple sites. We discuss the lessons learned from this processing campaign and several avenues for future improvements.
翻译:我们研究了基于云的GPU加速推理服务器在中微子数据批量作业中加速事件重建的性能。利用ProtoDUNE实验的探测器数据,并使用标准的DUNE网格作业提交工具,我们尝试通过运行数千个并发网格作业来重新处理数据,这一速率预计将成为当前和未来中微子物理实验的典型情况。我们使用处理算法的GPU版本处理了大部分数据集,其余部分则使用CPU版本进行时间对比。研究发现,一个配备100个GPU的云服务器能够轻松满足处理需求,并且与样本中最新的CPU相比,使用事件处理算法的GPU版本处理数据的速度是CPU版本的两倍。然而,在GPU运行期间传输到推理服务器的数据量可能会淹没最高带宽的网络交换机,除非注意遵守网络设施限制或将作业分配到多个站点。我们讨论了此次处理活动中的经验教训以及未来改进的若干方向。