Homomorphic encryption (HE) is a privacy-preserving computation technique that enables computation on encrypted data. Today, the potential of HE remains largely unrealized as it is impractically slow, preventing it from being used in real applications. A major computational bottleneck in HE is the key-switching operation, accounting for approximately 70% of the overall HE execution time and involving a large amount of data for inputs, intermediates, and keys. Prior research has focused on hardware accelerators to improve HE performance, typically featuring large on-chip SRAMs and high off-chip bandwidth to deal with large scale data. In this paper, we present a novel approach to improve key-switching performance by rigorously analyzing its dataflow. Our primary goal is to optimize data reuse with limited on-chip memory to minimize off-chip data movement. We introduce three distinct dataflows: Max-Parallel (MP), Digit-Centric (DC), and Output-Centric (OC), each with unique scheduling approaches for key-switching computations. Through our analysis, we show how our proposed Output-Centric technique can effectively reuse data by significantly lowering the intermediate key-switching working set and alleviating the need for massive off-chip bandwidth. We thoroughly evaluate the three dataflows using the RPU, a recently published vector processor tailored for ring processing algorithms, which includes HE. This evaluation considers sweeps of bandwidth and computational throughput, and whether keys are buffered on-chip or streamed. With OC, we demonstrate up to 4.16x speedup over the MP dataflow and show how OC can save 16x on-chip SRAM by streaming keys for minimal performance penalty.
翻译:同态加密(HE)是一种支持在加密数据上进行计算的隐私保护计算技术。当前,由于HE计算速度过慢而无法实际应用于真实场景,其潜力仍未得到充分实现。HE中的一个主要计算瓶颈是密钥切换操作,该操作约占整体HE执行时间的70%,并涉及大量输入数据、中间结果和密钥。先前的研究主要集中在通过硬件加速器来提升HE性能,这些加速器通常配备大容量片上SRAM和高片外带宽以处理大规模数据。本文提出了一种新方法,通过严格分析密钥切换的数据流来提升其性能。我们的主要目标是在有限片上内存条件下优化数据重用,从而最小化片外数据移动。我们引入了三种不同的数据流:最大并行(MP)、数字中心(DC)和输出中心(OC),每种数据流都采用独特的密钥切换计算调度方法。通过分析,我们展示了所提出的输出中心(OC)技术如何通过显著降低密钥切换中间工作集来有效重用数据,并减轻对大规模片外带宽的需求。我们使用近期发布的专为环处理算法(包括HE)设计的向量处理器RPU,对这三种数据流进行了全面评估。该评估考虑了带宽和计算吞吐量的变化,以及密钥是缓存在片上还是流式传输的情况。采用OC数据流时,我们实现了相比MP数据流最高4.16倍的加速,并展示了OC通过流式传输密钥可将片上SRAM节省16倍,同时仅产生极小的性能损失。