HE^2: A Communication-Light Heterogeneous Architecture for Efficient Fully Homomorphic Encryption

CKKS, an emerging fully homomorphic encryption (FHE) scheme, has been promising in privacy-preserving applications by enabling SIMD fixed-point computations on ciphertexts. Despite its strong security guarantees, CKKS involves both compute-intensive operators (ComOps) with high computational cost and memory-intensive operators (MemOps) with large memory footprints, making existing ASIC-based or NMP-based acceleration approaches suffer from high hardware overhead and limited efficiency. This observation motivates the integration of the architectural advantages of both paradigms into a heterogeneous xPU (ASIC)-xMU (NMP) architecture. However, in such a design, frequent and long-latency heterogeneous communication caused by the dominant keyswitch operator remains a key performance bottleneck. In this paper, we propose $HE^2$, a communication-light xPU-xMU heterogeneous FHE accelerator with dataflow graph (DFG) optimization and architecture co-design. First, we observe that the majority of communication arises at the interface between ModUp/ModDown and neighboring MemOps. To address this, we propose a DFG-level optimization framework to fully exploit the ModUp/ModDown reduction potential of the hoisting algorithm by identifying parallel keyswitch blocks and fusing them for reduced communication frequency. Second, we design an efficient heterogeneous architecture that adopts a group-level pipelined execution to effectively hide communication latency by leveraging the inherent parallelism across decomposed groups. End-to-end evaluation results show that $HE^2$ achieves 1.66$\times$ speedup and 9.23$\times$ lower EDAP (Energy-Delay-Area Product) compared to the state-of-the-art accelerator, with communication stalls accounting for only 6.67% of the total latency.

翻译：CKKS作为一种新兴的全同态加密方案，通过支持密文上的SIMD定点计算，在隐私保护应用中展现出巨大潜力。尽管具有强大的安全保证，CKKS同时包含计算密集型算子（高计算成本）和内存密集型算子（大内存占用），使得现有基于ASIC或近内存计算的加速方法面临高硬件开销和有限效率的问题。这一观察促使我们将两种范式的架构优势整合到异构xPU（ASIC）-xMU（近内存计算）架构中。然而在此类设计中，由主导性密钥切换算子引发的频繁长延迟异构通信仍是关键性能瓶颈。本文提出$HE^2$，一种采用数据流图优化与架构协同设计的轻通信异构FHE加速器。首先，我们发现大部分通信发生在ModUp/ModDown与相邻MemOps的接口处。为此，我们提出数据流图级优化框架，通过识别并行密钥切换块并进行融合以降低通信频率，充分挖掘提升算法的ModUp/ModDown缩减潜力。其次，我们设计了高效异构架构，采用组级流水线执行，通过利用分解组间的固有并行性有效隐藏通信延迟。端到端评估结果表明，与最先进的加速器相比，$HE^2$实现1.66倍加速比和9.23倍EDAP（能量-延迟-面积积）降低，通信停顿仅占总延迟的6.67%。