In this paper, we present the first multi-modal FHE accelerator based on a unified architecture, which efficiently supports CKKS, TFHE, and their conversion scheme within a single accelerator. To achieve this goal, we first analyze the theoretical foundations of the aforementioned schemes and highlight their composition from a finite number of arithmetic kernels. Then, we investigate the challenges for efficiently supporting these kernels within a unified architecture, which include 1) concurrent support for NTT and FFT, 2) maintaining high hardware utilization across various polynomial lengths, and 3) ensuring consistent performance across diverse arithmetic kernels. To tackle these challenges, we propose a novel FHE accelerator named Trinity, which incorporates algorithm optimizations, hardware component reuse, and dynamic workload scheduling to enhance the acceleration of CKKS, TFHE, and their conversion scheme. By adaptive select the proper allocation of components for NTT and MAC, Trinity maintains high utilization across NTTs with various polynomial lengths and imbalanced arithmetic workloads. The experiment results show that, for the pure CKKS and TFHE workloads, the performance of our Trinity outperforms the state-of-the-art accelerator for CKKS (SHARP) and TFHE (Morphling) by 1.49x and 4.23x, respectively. Moreover, Trinity achieves 919.3x performance improvement for the FHE-conversion scheme over the CPU-based implementation. Notably, despite the performance improvement, the hardware overhead of Trinity is only 85% of the summed circuit areas of SHARP and Morphling.
翻译:本文提出首个基于统一架构的多模态全同态加密加速器,其能在单一加速器内高效支持CKKS、TFHE及其转换方案。为实现该目标,我们首先分析上述方案的理论基础,并指出它们均由有限数量的算术内核构成。随后,我们研究了在统一架构中高效支持这些内核所面临的挑战,包括:1)对NTT与FFT的并发支持;2)在不同多项式长度下保持高硬件利用率;3)确保多样化算术内核间的性能一致性。为应对这些挑战,我们提出名为Trinity的新型全同态加密加速器,其融合算法优化、硬件组件复用及动态工作负载调度技术,以增强对CKKS、TFHE及其转换方案的加速效果。通过自适应选择NTT与MAC运算的组件分配策略,Trinity能在不同多项式长度的NTT运算及不均衡算术工作负载下保持高利用率。实验结果表明:对于纯CKKS与TFHE工作负载,Trinity的性能分别超越当前最先进的CKKS加速器(SHARP)与TFHE加速器(Morphling)1.49倍与4.23倍。此外,Trinity在全同态加密转换方案上相比基于CPU的实现获得919.3倍的性能提升。值得注意的是,在实现性能提升的同时,Trinity的硬件开销仅为SHARP与Morphling电路面积总和的85%。