Homomorphic encryption (HE) is a privacy-preserving computation technique that enables computation on encrypted data. Today, the potential of HE remains largely unrealized as it is impractically slow, preventing it from being used in real applications. A major computational bottleneck in HE is the key-switching operation, accounting for approximately 70% of the overall HE execution time and involving a large amount of data for inputs, intermediates, and keys. Prior research has focused on hardware accelerators to improve HE performance, typically featuring large on-chip SRAMs and high off-chip bandwidth to deal with large scale data. In this paper, we present a novel approach to improve key-switching performance by rigorously analyzing its dataflow. Our primary goal is to optimize data reuse with limited on-chip memory to minimize off-chip data movement. We introduce three distinct dataflows: Max-Parallel (MP), Digit-Centric (DC), and Output-Centric (OC), each with unique scheduling approaches for key-switching computations. Through our analysis, we show how our proposed Output-Centric technique can effectively reuse data by significantly lowering the intermediate key-switching working set and alleviating the need for massive off-chip bandwidth. We thoroughly evaluate the three dataflows using the RPU, a recently published vector processor tailored for ring processing algorithms, which includes HE. This evaluation considers sweeps of bandwidth and computational throughput, and whether keys are buffered on-chip or streamed. With OC, we demonstrate up to 4.16x speedup over the MP dataflow and show how OC can save 16x on-chip SRAM by streaming keys for minimal performance penalty.
翻译:同态加密(HE)是一种支持对加密数据进行计算的隐私保护计算技术。目前,由于HE运行速度过慢而难以实用,其潜力在很大程度上尚未被实际应用所实现。HE中的主要计算瓶颈是密钥切换操作,该操作占用整体HE执行时间的约70%,并涉及大量输入、中间结果和密钥数据。此前的研究侧重于通过硬件加速器提升HE性能,通常采用大规模片上SRAM和高片外带宽以处理大规模数据。本文提出了一种新颖方法,通过严格分析密钥切换的数据流来提升其性能。我们的主要目标是在有限的片上内存条件下优化数据重用,从而最小化片外数据移动。我们引入了三种不同的数据流:最大并行(MP)、数字中心(DC)和输出中心(OC),每种数据流均具有独特的密钥切换计算调度策略。通过分析,我们展示了所提出的输出中心技术如何通过显著降低中间密钥切换工作集大小并缓解对片外带宽的需求,有效实现数据重用。我们使用RPU(一种近期发布的、针对环处理算法(包括HE)设计的向量处理器)对三种数据流进行了全面评估。评估考虑了带宽与计算吞吐量的变化,以及密钥是缓存在片上还是流式加载的情况。采用OC数据流后,我们实现了相比MP数据流高达4.16倍的加速,并展示了OC通过流式传输密钥(仅带来极小的性能损失)可节省16倍片上SRAM。