Performance comparison of Dask and Apache Spark on HPC systems for Neuroimaging

The general increase in data size and data sharing motivates the adoption of Big Data strategies in several scientific disciplines. However, while several options are available, no particular guidelines exist for selecting a Big Data engine. In this paper, we compare the runtime performance of two popular Big Data engines with Python APIs, Apache Spark, and Dask, in processing neuroimaging pipelines. Our experiments use three synthetic \HL{neuroimaging} applications to process the \SI{606}{\gibi\byte} BigBrain image and an actual pipeline to process data from thousands of anatomical images. We benchmark these applications on a dedicated HPC cluster running the Lustre file system while using varying combinations of the number of nodes, file size, and task duration. Our results show that although there are slight differences between Dask and Spark, the performance of the engines is comparable for data-intensive applications. However, Spark requires more memory than Dask, which can lead to slower runtime depending on configuration and infrastructure. In general, the limiting factor was the data transfer time. While both engines are suitable for neuroimaging, more efforts need to be put to reduce the data transfer time and the memory footprint of applications.

翻译：数据规模的普遍增长与数据共享的普及，推动了多个科学领域对大数据策略的采用。然而，尽管存在多种可选方案，目前仍缺乏选择大数据处理引擎的具体指导原则。本文比较了两种具有Python API的流行大数据引擎——Apache Spark和Dask——在处理神经影像流程时的运行时性能。我们通过三个合成的神经影像应用程序处理606GiB的BigBrain图像，并采用实际流程处理数千张解剖图像数据。实验在运行Lustre文件系统的专用高性能计算集群上进行，通过调整节点数量、文件大小和任务时长的组合进行基准测试。结果表明，尽管Dask与Spark之间存在细微差异，但在数据密集型应用中两者的性能表现相当。然而，Spark比Dask需要更多内存，这可能因配置和基础设施差异导致运行时延长。总体而言，数据传输时间是最主要的限制因素。虽然两种引擎均适用于神经影像处理，但仍需进一步努力减少应用程序的数据传输时间和内存占用。

相关内容

Spark

关注 51

Apache Spark 是专为大规模数据处理而设计的快速通用的计算引擎。Spark是UC Berkeley AMP lab (加州大学伯克利分校的AMP实验室)所开源的类Hadoop MapReduce的通用并行框架，Spark，拥有Hadoop MapReduce所具有的优点；但不同于MapReduce的是Job中间输出结果可以保存在内存中，从而不再需要读写HDFS，因此Spark能更好地适用于数据挖掘与机器学习等需要迭代的MapReduce的算法。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日