The general increase in data size and data sharing motivates the adoption of Big Data strategies in several scientific disciplines. However, while several options are available, no particular guidelines exist for selecting a Big Data engine. In this paper, we compare the runtime performance of two popular Big Data engines with Python APIs, Apache Spark, and Dask, in processing neuroimaging pipelines. Our experiments use three synthetic \HL{neuroimaging} applications to process the \SI{606}{\gibi\byte} BigBrain image and an actual pipeline to process data from thousands of anatomical images. We benchmark these applications on a dedicated HPC cluster running the Lustre file system while using varying combinations of the number of nodes, file size, and task duration. Our results show that although there are slight differences between Dask and Spark, the performance of the engines is comparable for data-intensive applications. However, Spark requires more memory than Dask, which can lead to slower runtime depending on configuration and infrastructure. In general, the limiting factor was the data transfer time. While both engines are suitable for neuroimaging, more efforts need to be put to reduce the data transfer time and the memory footprint of applications.
翻译:数据规模的普遍增长与数据共享的普及,推动了多个科学领域对大数据策略的采用。然而,尽管存在多种可选方案,目前仍缺乏选择大数据处理引擎的具体指导原则。本文比较了两种具有Python API的流行大数据引擎——Apache Spark和Dask——在处理神经影像流程时的运行时性能。我们通过三个合成的神经影像应用程序处理606GiB的BigBrain图像,并采用实际流程处理数千张解剖图像数据。实验在运行Lustre文件系统的专用高性能计算集群上进行,通过调整节点数量、文件大小和任务时长的组合进行基准测试。结果表明,尽管Dask与Spark之间存在细微差异,但在数据密集型应用中两者的性能表现相当。然而,Spark比Dask需要更多内存,这可能因配置和基础设施差异导致运行时延长。总体而言,数据传输时间是最主要的限制因素。虽然两种引擎均适用于神经影像处理,但仍需进一步努力减少应用程序的数据传输时间和内存占用。