CiFlow: Dataflow Analysis and Optimization of Key Switching for Homomorphic Encryption

Homomorphic encryption (HE) is a privacy-preserving computation technique that enables computation on encrypted data. Today, the potential of HE remains largely unrealized as it is impractically slow, preventing it from being used in real applications. A major computational bottleneck in HE is the key-switching operation, accounting for approximately 70% of the overall HE execution time and involving a large amount of data for inputs, intermediates, and keys. Prior research has focused on hardware accelerators to improve HE performance, typically featuring large on-chip SRAMs and high off-chip bandwidth to deal with large scale data. In this paper, we present a novel approach to improve key-switching performance by rigorously analyzing its dataflow. Our primary goal is to optimize data reuse with limited on-chip memory to minimize off-chip data movement. We introduce three distinct dataflows: Max-Parallel (MP), Digit-Centric (DC), and Output-Centric (OC), each with unique scheduling approaches for key-switching computations. Through our analysis, we show how our proposed Output-Centric technique can effectively reuse data by significantly lowering the intermediate key-switching working set and alleviating the need for massive off-chip bandwidth. We thoroughly evaluate the three dataflows using the RPU, a recently published vector processor tailored for ring processing algorithms, which includes HE. This evaluation considers sweeps of bandwidth and computational throughput, and whether keys are buffered on-chip or streamed. With OC, we demonstrate up to 4.16x speedup over the MP dataflow and show how OC can save 12.25x on-chip SRAM by streaming keys for minimal performance penalty.

翻译：摘要：同态加密（HE）是一种保护隐私的计算技术，可在加密数据上执行计算。然而，由于HE速度过慢且不切实际，其潜力在当今仍未得到充分实现，阻碍了其在真实应用中的使用。HE的一个主要计算瓶颈是密钥切换操作，该操作约占HE总执行时间的70%，并涉及大量输入数据、中间结果和密钥。先前的研究集中于通过硬件加速器提升HE性能，这类加速器通常配备大型片上SRAM和高片外带宽以处理大规模数据。本文提出了一种新方法，通过严格分析密钥切换的数据流来提升其性能。我们的主要目标是在有限的片上内存下优化数据重用，以减少片外数据移动。我们引入了三种不同的数据流：最大并行（MP）、数字中心（DC）和输出中心（OC），每种数据流对密钥切换计算采用独特的调度策略。通过分析，我们展示了所提出的输出中心技术如何通过显著降低密钥切换中间工作集并缓解对大规模片外带宽的需求，有效实现数据重用。我们使用RPU（一种近期发布的专为环处理算法（包括HE）设计的向量处理器）对三种数据流进行了全面评估。该评估考虑了带宽和计算吞吐量的扫描范围，以及密钥是存储在片上还是以流式传输。采用OC数据流，我们实现了相比MP数据流高达4.16倍的加速，并展示了OC如何通过流式传输密钥在极小性能损失下节省12.25倍的片上SRAM。