Leveraging ASIC AI Chips for Homomorphic Encryption

Jianming Tong,Tianhao Huang,Jingtian Dang,Leo de Castro,Anirudh Itagi,Anupam Golder,Asra Ali,Jeremy Kun,Jevin Jiang, Arvind,G. Edward Suh,Tushar Krishna

from arxiv, IEEE International Symposium on High-Performance Computer Architecture (HPCA) 2026; 18 pages, 16 figures, 5 algorithms, 10 tables. Leveraging Google TPUs for Homomorphic Encryption

Homomorphic Encryption (HE) provides strong data privacy for cloud services but at the cost of prohibitive computational overhead. While GPUs have emerged as a practical platform for accelerating HE, there remains an order-of-magnitude energy-efficiency gap compared to specialized (but expensive) HE ASICs. This paper explores an alternate direction: leveraging existing AI accelerators, like Google's TPUs with coarse-grained compute and memory architectures, to offer a path toward ASIC-level energy efficiency for HE. However, this architectural paradigm creates a fundamental mismatch with SoTA HE algorithms designed for GPUs. These algorithms rely heavily on: (1) high-precision (32-bit) integer arithmetic to now run on a TPU's low-throughput vector unit, leaving its high-throughput low-precision (8-bit) matrix engine (MXU) idle, and (2) fine-grained data permutations that are inefficient on the TPU's coarse-grained memory subsystem. Consequently, porting GPU-optimized HE libraries to TPUs results in severe resource under-utilization and performance degradation. To tackle above challenges, we introduce CROSS, a compiler framework that systematically transforms HE workloads to align with the TPU's architecture. CROSS makes two key contributions: (1) Basis-Aligned Transformation (BAT), a novel technique that converts high-precision modular arithmetic into dense, low-precision (INT8) matrix multiplications, unlocking and improving the utilization of TPU's MXU for HE, and (2) Memory-Aligned Transformation (MAT), which eliminates costly runtime data reordering by embedding reordering into compute kernels through offline parameter transformation. CROSS (TPU v6e) achieves higher throughput per watt on NTT and HE operators than WarpDrive, FIDESlib, FAB, HEAP, and Cheddar, establishing AI ASIC as the SotA efficient platform for HE operators. Code: https://github.com/EfficientPPML/CROSS

翻译：同态加密（HE）为云服务提供了强大的数据隐私保护，但其计算开销巨大，难以承受。虽然GPU已成为加速HE的实用平台，但与专用（但昂贵）的HE ASIC相比，其能效仍存在数量级差距。本文探索了一种替代方向：利用现有的AI加速器（如谷歌TPU，其具有粗粒度计算和内存架构），为HE提供一条通向ASIC级能效的路径。然而，这种架构范式与为GPU设计的最先进HE算法存在根本性不匹配。这些算法严重依赖于：（1）高精度（32位）整数运算，现在却需在TPU的低吞吐量向量单元上运行，导致其高吞吐量低精度（8位）矩阵引擎（MXU）闲置；（2）细粒度数据置换，这在TPU的粗粒度内存子系统上效率低下。因此，将GPU优化的HE库移植到TPU上会导致严重的资源利用不足和性能下降。为解决上述挑战，我们引入了CROSS，一个系统性地转换HE工作负载以适配TPU架构的编译器框架。CROSS做出了两个关键贡献：（1）基对齐转换（BAT），这是一种新颖的技术，将高精度模运算转换为密集的低精度（INT8）矩阵乘法，从而解锁并提高了TPU的MXU在HE中的利用率；（2）内存对齐转换（MAT），通过离线参数转换将数据重排序嵌入计算内核，从而消除了昂贵的运行时数据重排序。CROSS（TPU v6e）在NTT和HE算子上的每瓦吞吐量均高于WarpDrive、FIDESlib、FAB、HEAP和Cheddar，确立了AI ASIC作为HE算子的最先进高效平台。代码：https://github.com/EfficientPPML/CROSS