Experience with Distributed Memory Delaunay-based Image-to-Mesh Conversion Implementation

This paper presents some of our findings on the scalability of parallel 3D mesh generation on distributed memory machines. The primary objective of this study was to evaluate a distributed memory approach for implementing a 3D parallel Delaunay-based algorithm that converts images to meshes by leveraging an efficient shared memory implementation. The secondary objective was to evaluate the effectiveness of labor (i.e., reduce development time) while introducing minimal overheads to maintain the parallel efficiency of the end-product i.e., distributed implementation. The distributed algorithm utilizes two existing and independently developed parallel Delaunay-based methods: (1) a fine-grained method that employs multi-threading and speculative execution on shared memory nodes and (2) a loosely coupled Delaunay-refinement framework for multi-node platforms. The shared memory implementation uses a FIFO work-sharing scheme for thread scheduling, while the distributed memory implementation utilizes the MPI and the Master-Worker (MW) model. The findings from the specific MPI-MW implementation we tested suggest that the execution on (1) 40 cores not necessary in the same single node is 2.3 times faster than the execution on ten cores, (2) the best speedup is 5.4 with 180 cores again the comparison is with the best performance on ten cores. A closer look at the performance of distributed memory and shared memory implementation executing on a single node (40 cores) suggest that the overheads introduced in the MPI-MW implementation are high and render the MPI-MW implementation 4 times slower than the shared memory code using the same number of cores. These findings raise several questions on the potential scalability of a "black box" approach, i.e., re-using a code designed to execute efficiently on shared memory machines without considering its potential use in a distributed memory setting.

翻译：本文介绍了我们在分布式存储机器上并行三维网格生成的扩展性研究成果。研究首要目标是通过利用高效的共享存储实现，评估基于Delaunay算法的三维并行图像到网格转换的分布式存储方法。次要目标是在引入最小开销以保持最终产品（即分布式实现）并行效率的前提下，评估开发工作的有效性（即缩短开发时间）。该分布式算法整合了两种现有的独立并行Delaunay方法：（1）在共享存储节点上采用多线程与推测执行的细粒度方法；以及（2）适用于多节点平台的松散耦合Delaunay精化框架。共享存储实现采用FIFO工作共享方案进行线程调度，而分布式存储实现则利用MPI与主从（MW）模型。我们对特定MPI-MW实现的测试结果表明：（1）在40个核心（无需集中于同一节点）上的执行速度比10个核心快2.3倍；（2）在180个核心上获得最佳加速比5.4倍，对比基准仍为10个核心的最佳性能。对单节点（40核）上分布式存储与共享存储实现的性能对比分析显示，MPI-MW实现引入的高额开销使其比同等核心数的共享存储代码慢4倍。这些发现对"黑箱"方法（即直接复用为共享存储机器高效设计的代码而未考虑其分布式存储环境下的潜在用途）的扩展性提出了若干疑问。