Matryoshka：基于弹性并行变换的动态多样化量子化学系统优化 (Matryoshka: Optimization of Dynamic Diverse Quantum Chemistry Systems via Elastic Parallelism Transformation)

AI infrastructures, predominantly GPUs, have delivered remarkable performance gains for deep learning. Conversely, scientific computing, exemplified by quantum chemistry systems, suffers from dynamic diversity, where computational patterns are more diverse and vary dynamically, posing a significant challenge to sponge acceleration off GPUs. In this paper, we propose Matryoshka, a novel elastically-parallel technique for the efficient execution of quantum chemistry system with dynamic diversity on GPU. Matryoshka capitalizes on Elastic Parallelism Transformation, a property prevalent in scientific systems yet underexplored for dynamic diversity, to elastically realign parallel patterns with GPU architecture. Structured around three transformation primitives (Permutation, Deconstruction, and Combination), Matryoshka encompasses three core components. The Block Constructor serves as the central orchestrator, which reformulates data structures accommodating dynamic inputs and constructs fine-grained GPU-efficient compute blocks. Within each compute block, the Graph Compiler operates offline, generating high-performance code with clear computational path through an automated compilation process. The Workload Allocator dynamically schedules workloads with varying operational intensities to threads online. It achieves highly efficient parallelism for compute-intensive operations and facilitates fusion with neighboring memory-intensive operations automatically. Extensive evaluation shows that Matryoshka effectively addresses dynamic diversity, yielding acceleration improvements of up to 13.86x (average 9.41x) over prevailing state-of-the-art approaches on 13 quantum chemistry systems.

翻译：以GPU为主的人工智能基础设施已为深度学习带来了显著的性能提升。相反，以量子化学系统为代表的科学计算则面临动态多样性的挑战，其计算模式更为多样且动态变化，这对GPU上的海绵加速构成了重大障碍。本文提出Matryoshka，一种新颖的弹性并行技术，用于在GPU上高效执行具有动态多样性的量子化学系统。Matryoshka利用弹性并行变换这一在科学系统中普遍存在但尚未被充分探索的特性，将并行模式与GPU架构进行弹性重对齐。围绕三种变换原语（置换、解构与组合）构建，Matryoshka包含三个核心组件。块构造器作为中央协调器，重构适应动态输入的数据结构，并构建细粒度的GPU高效计算块。在每个计算块内部，图编译器离线运行，通过自动化编译过程生成具有清晰计算路径的高性能代码。工作负载分配器在线动态调度具有不同运算强度的任务至线程，为计算密集型操作实现高效并行，并自动促进其与邻近内存密集型操作的融合。大量实验表明，Matryoshka能有效应对动态多样性，在13个量子化学系统上相比当前主流先进方法实现了最高13.86倍（平均9.41倍）的加速提升。