Fully Homomorphic Encryption (FHE) enables the processing of encrypted data without decrypting it. FHE has garnered significant attention over the past decade as it supports secure outsourcing of data processing to remote cloud services. Despite its promise of strong data privacy and security guarantees, FHE introduces a slowdown of up to five orders of magnitude as compared to the same computation using plaintext data. This overhead is presently a major barrier to the commercial adoption of FHE. In this work, we leverage GPUs to accelerate FHE, capitalizing on a well-established GPU ecosystem available in the cloud. We propose GME, which combines three key microarchitectural extensions along with a compile-time optimization to the current AMD CDNA GPU architecture. First, GME integrates a lightweight on-chip compute unit (CU)-side hierarchical interconnect to retain ciphertext in cache across FHE kernels, thus eliminating redundant memory transactions. Second, to tackle compute bottlenecks, GME introduces special MOD-units that provide native custom hardware support for modular reduction operations, one of the most commonly executed sets of operations in FHE. Third, by integrating the MOD-unit with our novel pipelined $64$-bit integer arithmetic cores (WMAC-units), GME further accelerates FHE workloads by $19\%$. Finally, we propose a Locality-Aware Block Scheduler (LABS) that exploits the temporal locality available in FHE primitive blocks. Incorporating these microarchitectural features and compiler optimizations, we create a synergistic approach achieving average speedups of $796\times$, $14.2\times$, and $2.3\times$ over Intel Xeon CPU, NVIDIA V100 GPU, and Xilinx FPGA implementations, respectively.
翻译:全同态加密(FHE)能够在不解密的情况下处理加密数据。过去十年中,FHE因其支持将数据处理安全外包至远程云服务而备受关注。尽管FHE承诺提供强大的数据隐私与安全保障,但与明文数据的相同计算相比,其引入的延迟高达五个数量级。这一开销目前是FHE商业应用的主要障碍。本文利用GPU加速FHE,借助云环境中成熟的GPU生态系统。我们提出GME,它结合了三种关键微架构扩展以及针对当前AMD CDNA GPU架构的编译时优化。首先,GME集成了一种轻量级片上计算单元(CU)侧层级互连,以在FHE内核间将密文保留在缓存中,从而消除冗余内存事务。其次,为解决计算瓶颈,GME引入了专用MOD单元,为模约减操作(FHE中最常执行的运算集之一)提供原生定制硬件支持。第三,通过将MOD单元与我们新颖的流水线化64位整数算术核心(WMAC单元)集成,GME进一步将FHE工作负载加速19%。最后,我们提出一种局部性感知块调度器(LABS),该调度器利用FHE原始块中的时间局部性。通过整合这些微架构特性与编译器优化,我们创建了一种协同方法,相较于Intel Xeon CPU、NVIDIA V100 GPU和Xilinx FPGA实现,分别实现了平均加速比796倍、14.2倍和2.3倍。