Locality-driven integration is a pervasive computational pattern in quantum chemistry, arising whenever spatially localized basis functions interact through numerical quadrature or integral screening. The dominant matrix multiplications in these tasks exhibit dynamic, structured sparsity driven by spatial locality, posing significant challenges for both dense batched kernels and generic sparse formats on GPUs. We present KerneLDI, a GPU-oriented framework that addresses this regime by co-designing data layout, screening logic, and matrix-computation operators to realize block-structured matrix multiplication for locality-driven integration. KerneLDI reorganizes operand matrices into a unified block-filtered representation that retains only spatially relevant blocks, and executes the resulting contractions with customized dense block multipliers that adapt proven dense-matmul optimizations to retained block pairs. We develop and evaluate KerneLDI on exchange--correlation (EXC) integration in Kohn--Sham density functional theory, a representative and computationally critical instance of this pattern. Across diverse molecular systems, KerneLDI preserves numerical accuracy while delivering up to 10$\times$ speedup for EXC evaluation over a dense GPU baseline, scales favorably with increasing system size and multi-GPU parallelism, accelerates end-to-end self-consistent field calculations, and yields nearly 6$\times$ throughput improvement for ab initio molecular dynamics.
翻译:局域驱动积分是量子化学中普遍存在的计算模式,只要空间局域基函数通过数值求积或积分筛选相互作用时就会出现。这些任务中的主要矩阵乘法展现出由空间局域性驱动的动态、结构化稀疏性,给GPU上的密集批处理内核和通用稀疏格式带来了重大挑战。我们提出KerneLDI,一个面向GPU的框架,通过协同设计数据布局、筛选逻辑和矩阵计算算子,实现块结构矩阵乘法以支持局域驱动积分。KerneLDI将操作数矩阵重组为统一的块过滤表示,仅保留空间相关块,并通过定制的密集块乘法器执行所得收缩,该乘法器将成熟的密集矩阵乘法优化适配到保留的块对上。我们在Kohn-Sham密度泛函理论中的交换关联积分(EXC)——该模式的一个代表性且计算关键的实例——上开发和评估了KerneLDI。对于多样化的分子体系,KerneLDI在保持数值精度的同时,相较于GPU密集基线,EXC评估速度提升高达10倍,随系统规模和GPU并行度增加呈良好扩展性,加速了端到端的自洽场计算,并为从头算分子动力学带来近6倍的吞吐量提升。