Homomorphic encryption (HE) is a privacy-preserving computation technique that enables computation on encrypted data. Today, the potential of HE remains largely unrealized as it is impractically slow, preventing it from being used in real applications. A major computational bottleneck in HE is the key-switching operation, accounting for approximately 70% of the overall HE execution time and involving a large amount of data for inputs, intermediates, and keys. Prior research has focused on hardware accelerators to improve HE performance, typically featuring large on-chip SRAMs and high off-chip bandwidth to deal with large scale data. In this paper, we present a novel approach to improve key-switching performance by rigorously analyzing its dataflow. Our primary goal is to optimize data reuse with limited on-chip memory to minimize off-chip data movement. We introduce three distinct dataflows: Max-Parallel (MP), Digit-Centric (DC), and Output-Centric (OC), each with unique scheduling approaches for key-switching computations. Through our analysis, we show how our proposed Output-Centric technique can effectively reuse data by significantly lowering the intermediate key-switching working set and alleviating the need for massive off-chip bandwidth. We thoroughly evaluate the three dataflows using the RPU, a recently published vector processor tailored for ring processing algorithms, which includes HE. This evaluation considers sweeps of bandwidth and computational throughput, and whether keys are buffered on-chip or streamed. With OC, we demonstrate up to 4.16x speedup over the MP dataflow and show how OC can save 12.25x on-chip SRAM by streaming keys for minimal performance penalty.
翻译:同态加密(HE)是一种在加密数据上进行计算的安全隐私计算技术。目前,由于HE在实际应用中速度过慢且不切实际,其潜力尚未得到充分发挥。HE的主要计算瓶颈在于密钥交换操作,该操作约占HE总执行时间的70%,并涉及大量输入数据、中间结果和密钥数据。此前的研究主要集中于通过硬件加速器提升HE性能,通常采用大容量片上SRAM和高片外带宽来处理大规模数据。本文提出了一种通过严格分析密钥交换数据流来提升其性能的新方法。我们的核心目标是在有限片上内存条件下优化数据重用,以最小化片外数据移动。我们引入了三种不同的数据流:最大并行(MP)、数字中心(DC)和输出中心(OC),每种数据流均采用独特的密钥交换计算调度策略。通过分析,我们展示了所提出的输出中心技术如何通过显著降低密钥交换工作集大小、缓解对片外带宽的依赖来有效实现数据重用。我们使用近期发布的面向环处理算法(包括HE)的向量处理器RPU对这三种数据流进行了全面评估。该评估涵盖了带宽与计算吞吐量的扫描,以及密钥是缓存在片上还是流式传输的场景。实验结果表明,采用OC数据流相比MP数据流可实现高达4.16倍的加速,且通过流式传输密钥仅带来微小性能损失,OC可节省12.25倍的片上SRAM资源。