Computations can be directly carried out over ciphertexts using homomorphic encryption (HE), which is indispensable for privacy-preserving cloud computing. Linear transformation is widely used in neural networks, including large language models. However, the implementation of linear transformation over HE requires a large number of ciphertext rotations, which incur significant memory and hardware overhead despite existing simplification techniques. This paper proposes a triple-hoisted baby-step giant-step algorithm that decomposes the baby step further to substantially reduce the number of ciphertext rotations needed for the CKKS HE evaluation of linear transformation. Moreover, to reduce off-chip memory access, which contributes to the majority of the latency, a memory-optimized data path is proposed by partitioning the algorithm into multiple phases. Furthermore, an efficient FPGA-based hardware accelerator with an optimized permutation circuit for message routing is designed for the proposed scheme. For a set of typical parameters, the proposed design reduces the off-chip memory access by 2.9x compared to the best prior design. Synthesized for Xilinx Virtex UltraScale+ devices, the proposed design achieves a 5.8x reduction in computational latency compared with the baseline design.
翻译:利用同态加密(HE)可直接对密文进行计算,这对于保护隐私的云计算不可或缺。线性变换广泛应用于神经网络,包括大语言模型。然而,在同态加密上实现线性变换需要大量的密文旋转操作,现有的简化技术虽已存在,但仍会造成显著的内存和硬件开销。本文提出一种三重悬挂的baby-step giant-step算法,通过进一步分解baby-step步骤,大幅减少了CKKS同态加密评估线性变换所需的密文旋转次数。此外,为降低造成大部分延迟的片外内存访问,本文通过将算法划分为多个阶段,设计了一条内存优化的数据路径。进一步地,针对所提方案,设计了一种基于FPGA的高效硬件加速器,其中包含用于消息路由的优化置换电路。对于一组典型参数,与先前最优秀的设计相比,所提设计将片外内存访问量减少了2.9倍。在Xilinx Virtex UltraScale+器件上综合后,与基线设计相比,所提设计的计算延迟降低了5.8倍。