Locality-driven integration is a pervasive computational pattern in quantum chemistry, arising whenever spatially localized basis functions interact through numerical quadrature or integral screening. The dominant matrix multiplications in these tasks exhibit dynamic, structured sparsity driven by spatial locality, posing significant challenges for both dense batched kernels and generic sparse formats on GPUs. We present KerneLDI, a GPU-oriented framework that addresses this regime by co-designing data layout, screening logic, and matrix-computation operators to realize block-structured matrix multiplication for locality-driven integration. KerneLDI reorganizes operand matrices into a unified block-filtered representation that retains only spatially relevant blocks, and executes the resulting contractions with customized dense block multipliers that adapt proven dense-matmul optimizations to retained block pairs. We develop and evaluate KerneLDI on exchange--correlation (EXC) integration in Kohn--Sham density functional theory, a representative and computationally critical instance of this pattern. Across diverse molecular systems, KerneLDI preserves numerical accuracy while delivering up to 10$\times$ speedup for EXC evaluation over a dense GPU baseline, scales favorably with increasing system size and multi-GPU parallelism, accelerates end-to-end self-consistent field calculations, and yields nearly 6$\times$ throughput improvement for ab initio molecular dynamics.
翻译:局域性驱动积分是量子化学中普遍存在的计算模式,只要空间局域基函数通过数值求积或积分筛选产生相互作用就会出现该模式。这些任务中的主导矩阵乘法展现出由空间局域性驱动的动态结构化稀疏性,给GPU上的稠密批处理核函数和通用稀疏格式带来重大挑战。我们提出KerneLDI——一个面向GPU的框架,通过协同设计数据布局、筛选逻辑和矩阵计算算子来实现基于块结构矩阵乘法的局域性驱动积分。KerneLDI将操作数矩阵重组为统一的块筛选表示,仅保留空间相关块,并使用自适应稠密块乘法器执行所得收缩运算,该乘法器将经过验证的稠密矩阵乘法优化方法应用于保留的块对。我们在Kohn-Sham密度泛函理论中的交换相关(EXC)积分(该模式的一个典型且计算关键的实例)上开发和评估了KerneLDI。针对不同分子系统,KerneLDI在保持数值精度的同时,相比稠密GPU基线实现了EXC评估高达10倍的加速,随系统规模增加和多GPU并行性展现出良好的可扩展性,加速了端到端自洽场计算,并实现了从头算分子动力学近6倍的吞吐量提升。