Cloud-based services are making the outsourcing of sensitive client data increasingly common. Although homomorphic encryption (HE) offers strong privacy guarantee, it requires substantially more resources than computing on plaintext, often leading to unacceptably large latencies in getting the results. HE accelerators have emerged to mitigate this latency issue, but with the high cost of ASICs. In this paper we show that HE primitives can be converted to AI operators and accelerated on existing ASIC AI accelerators, like TPUs, which are already widely deployed in the cloud. Adapting such accelerators for HE requires (1) supporting modular multiplication, (2) high-precision arithmetic in software, and (3) efficient mapping on matrix engines. We introduce the CROSS compiler (1) to adopt Barrett reduction to provide modular reduction support using multiplier and adder, (2) Basis Aligned Transformation (BAT) to convert high-precision multiplication as low-precision matrix-vector multiplication, (3) Matrix Aligned Transformation (MAT) to covert vectorized modular operation with reduction into matrix multiplication that can be efficiently processed on 2D spatial matrix engine. Our evaluation of CROSS on a Google TPUv4 demonstrates significant performance improvements, with up to 161x and 5x speedup compared to the previous work on many-core CPUs and V100. The kernel-level codes are open-sourced at https://github.com/google/jaxite/tree/main/jaxite_word.
翻译:基于云的服务使得外包敏感客户数据日益普遍。虽然同态加密(HE)提供了强大的隐私保障,但其所需资源远高于明文计算,通常导致获取结果时产生无法接受的巨大延迟。为缓解此延迟问题,已出现HE加速器,但伴随而来的是ASIC的高昂成本。本文证明,HE原语可转换为AI算子,并在现有已广泛部署于云端的ASIC AI加速器(如TPU)上获得加速。为此类加速器适配HE需满足:(1)支持模乘运算,(2)软件中的高精度算术运算,以及(3)在矩阵引擎上的高效映射。我们引入CROSS编译器:(1)采用Barrett约简,利用乘法器和加法器提供模约简支持;(2)通过基对齐变换(BAT)将高精度乘法转换为低精度矩阵-向量乘法;(3)通过矩阵对齐变换(MAT)将带约简的向量化模运算转换为可在二维空间矩阵引擎上高效处理的矩阵乘法。我们在Google TPUv4上对CROSS的评估展示了显著的性能提升,与先前在多核CPU和V100上的工作相比,最高分别实现了161倍和5倍的加速。内核级代码已在 https://github.com/google/jaxite/tree/main/jaxite_word 开源。