Fully homomorphic encryption (FHE) enables secure computation on encrypted data, mitigating privacy concerns in cloud and edge environments. However, due to its high compute and memory demands, extensive acceleration research has been pursued across diverse hardware platforms, especially GPUs. In this paper, we perform a microarchitectural analysis of CKKS, a popular FHE scheme, on modern GPUs. Focusing on the memory hierarchy, we demonstrate that dominant kernels remain bound by the on-chip L2 cache despite its high bandwidth, exposing a persistent inner memory wall beyond the conventional off-chip DRAM bottleneck. Further, we reveal that the overall CKKS throughput is constrained by low per-kernel hardware utilization, caused by insufficient intra-kernel parallelism. Motivated by these findings, we introduce Theodosian, a set of complementary, memory-aware optimizations that improve cache efficiency and reduce runtime overheads. Theodosian achieves 1.45--1.83x performance improvements over a highly optimized baseline, Cheddar, across representative CKKS workloads. On an RTX 5090, we reduce the bootstrapping latency for 32,768 complex numbers from 22.1ms to 15.2ms, and further to 12.8ms with additional algorithmic optimizations, establishing a new state-of-the-art GPU performance to the best of our knowledge.
翻译:全同态加密(FHE)支持在加密数据上直接进行安全计算,有效缓解了云与边缘环境中的隐私顾虑。然而,由于其极高的计算与内存需求,学界已在多种硬件平台(尤其是GPU)上开展了广泛的加速研究。本文针对现代GPU上流行的FHE方案CKKS进行了微架构层面的分析。聚焦于内存层次结构,我们发现尽管片上L2缓存具备高带宽,但核心计算内核仍受其限制,这揭示了在传统片外DRAM瓶颈之外持续存在的内部内存墙问题。此外,我们指出CKKS的整体吞吐量受限于较低的单内核硬件利用率,其根源在于内核内部并行性不足。基于这些发现,我们提出了狄奥多西安——一套互补的内存感知优化方案,通过提升缓存效率与降低运行时开销实现性能突破。在典型的CKKS工作负载上,狄奥多西安相比高度优化的基准系统Cheddar实现了1.45至1.83倍的性能提升。在RTX 5090平台上,我们将32,768个复数的自举延迟从22.1毫秒降低至15.2毫秒,并结合算法优化进一步缩减至12.8毫秒。据我们所知,这创造了当前GPU性能的最新纪录。