The computation of electron repulsion integrals (ERIs) over Gaussian-type orbitals (GTOs) is a challenging problem in quantum-mechanics-based atomistic simulations. In practical simulations, several trillions of ERIs may have to be computed for every time step. In this work, we investigate FPGAs as accelerators for the ERI computation. We use template parameters, here within the Intel oneAPI tool flow, to create customized designs for 256 different ERI quartet classes, based on their orbitals. To maximize data reuse, all intermediates are buffered in FPGA on-chip memory with customized layout. The pre-calculation of intermediates also helps to overcome data dependencies caused by multi-dimensional recurrence relations. The involved loop structures are partially or even fully unrolled for high throughput of FPGA kernels. Furthermore, a lossy compression algorithm utilizing arbitrary bitwidth integers is integrated in the FPGA kernels. To our best knowledge, this is the first work on ERI computation on FPGAs that supports more than just the single most basic quartet class. Also, the integration of ERI computation and compression it a novelty that is not even covered by CPU or GPU libraries so far. Our evaluation shows that using 16-bit integer for the ERI compression, the fastest FPGA kernels exceed the performance of 10 GERIS ($10 \times 10^9$ ERIs per second) on one Intel Stratix 10 GX 2800 FPGA, with maximum absolute errors around $10^{-7}$ - $10^{-5}$ Hartree. The measured throughput can be accurately explained by a performance model. The FPGA kernels deployed on 2 FPGAs outperform similar computations using the widely used libint reference on a two-socket server with 40 Xeon Gold 6148 CPU cores of the same process technology by factors up to 6.0x and on a new two-socket server with 128 EPYC 7713 CPU cores by up to 1.9x.
翻译:高斯型轨道(GTO)上的电子排斥积分(ERI)计算是基于量子力学的原子模拟中的一个具有挑战性的问题。在实际模拟中,每个时间步可能需要计算数万亿个ERI。在本工作中,我们研究了将FPGA作为ERI计算的加速器。我们利用模板参数(在Intel oneAPI工具流中),根据轨道类型,为256种不同的ERI四元组类别创建定制化设计。为了最大化数据复用,所有中间结果都存储在具有定制布局的FPGA片上存储器中。中间结果的预计算也有助于克服由多维递推关系引起的数据依赖性。涉及的循环结构被部分甚至完全展开,以实现FPGA内核的高吞吐量。此外,一种利用任意位宽整数的有损压缩算法被集成到FPGA内核中。据我们所知,这是首个支持不止单一最基础四元组类别的FPGA上的ERI计算工作。同时,ERI计算与压缩的集成也是一项创新,即使是目前的CPU或GPU库也尚未涵盖。我们的评估表明,使用16位整数进行ERI压缩时,最快的FPGA内核在一颗Intel Stratix 10 GX 2800 FPGA上性能超过10 GERIS(每秒$10 \times 10^9$个ERI),最大绝对误差在$10^{-7}$至$10^{-5}$ Hartree范围内。测量得到的吞吐量可以通过性能模型精确解释。部署在2颗FPGA上的FPGA内核,与使用广泛使用的libint参考库在相同工艺技术的双路服务器(40核Xeon Gold 6148 CPU)上进行的类似计算相比,性能提升高达6.0倍;与新型双路服务器(128核EPYC 7713 CPU)相比,性能提升高达1.9倍。