Towards Distributed Semi-speculative Adaptive Anisotropic Parallel Mesh Generation

This paper presents the foundational elements of a distributed memory method for mesh generation that is designed to leverage concurrency offered by large-scale computing. To achieve this goal, meshing functionality is separated from performance aspects by utilizing a separate entity for each - a shared memory mesh generation code called CDT3D and PREMA for parallel runtime support. Although CDT3D is designed for scalability, lessons are presented regarding additional measures that were taken to enable the code's integration into the distributed memory method as a black box. In the presented method, an initial mesh is data decomposed and subdomains are distributed amongst the nodes of a high-performance computing (HPC) cluster. Meshing operations within CDT3D utilize a speculative execution model, enabling the strict adaptation of subdomains' interior elements. Interface elements undergo several iterations of shifting so that they are adapted when their data dependencies are resolved. PREMA aids in this endeavor by providing asynchronous message passing between encapsulations of data, work load balancing, and migration capabilities all within a globally addressable namespace. PREMA also assists in establishing data dependencies between subdomains, thus enabling "neighborhoods" of subdomains to work independently of each other in performing interface shifts and adaptation. Preliminary results show that the presented method is able to produce meshes of comparable quality to those generated by the original shared memory CDT3D code. Given the costly overhead of collective communication seen by existing state-of-the-art software, relative communication performance of the presented distributed memory method also shows that its emphasis on avoiding global synchronization presents a potentially viable solution in achieving scalability when targeting large configurations of cores.

翻译：本文介绍了分布式内存网格生成方法的基础要素，该方法旨在利用大规模计算提供的并发性。为实现这一目标，通过为每个方面使用独立实体将网格生成功能与性能问题分离：共享内存网格生成代码CDT3D用于功能实现，PREMA用于并行运行时支持。尽管CDT3D专为可扩展性设计，但本文提出了为将该代码以黑盒方式集成到分布式内存方法中所采取的额外措施的经验。在提出的方法中，初始网格经过数据分解，子域分布在高性能计算（HPC）集群的节点之间。CDT3D中的网格生成操作采用推测执行模型，从而严格适应子域的内部元素。接口元素经历多次移位迭代，使其在数据依赖关系解决时得到适配。PREMA通过提供异步消息传递（在数据封装之间）、工作负载平衡和迁移能力（均在全局可寻址命名空间内）来支持这一过程。PREMA还协助建立子域之间的数据依赖关系，从而使子域的“邻域”能够彼此独立地执行接口移位和适配。初步结果表明，所提出的方法能够生成与原始共享内存CDT3D代码质量相当的网格。鉴于现有最先进软件中集体通信的高昂开销，所提出的分布式内存方法的相对通信性能还表明，其强调避免全局同步的策略是实现大规模核心配置可扩展性的潜在可行方案。