Fully Homomorphic Encryption (FHE) allows for computation directly on encrypted data and enables privacy-preserving neural inference in the cloud. Prior work has focused on models with dense inputs (e.g., CNNs), with less attention given to those with sparse inputs such as Deep Learning Recommendation Models (DLRMs). These models require encrypted lookup into large embedding tables that are challenging to implement using FHE's restrictive operators and introduce significant overhead. In this paper, we develop performance optimizations to efficiently support embedding lookups in FHE-based inference pipelines. First, we present an embedding compression technique using client-side digit decomposition that achieves a 56$\times$ speedup over state-of-the-art. Next, we propose a multi-embedding packing strategy that enables ciphertext SIMD-parallel lookups across multiple tables. Crucially, our goal is not only to retrieve the correct embeddings, but to do so in a way that produces ciphertext outputs in a layout that is directly compatible with downstream encrypted computations server-side. We name our approach HE-LRM and demonstrate end-to-end encrypted DLRM inference. We evaluate HE-LRM on UCI (health prediction) and Criteo (click prediction), achieving inference latencies of 24 and 489 seconds, respectively, on a single-threaded CPU. Finally, while our evaluation focuses on DLRMs, we investigate and apply our embedding-lookup primitives to other models such as LLMs, which require both batched and single-embedding lookups.
翻译:全同态加密(FHE)允许直接在加密数据上进行计算,从而实现在云端进行隐私保护的神经推理。先前的研究主要集中于具有密集输入的模型(例如CNN),而对稀疏输入模型(如深度学习推荐模型DLRM)的关注较少。这些模型需要对大型嵌入表进行加密查找,这在使用FHE的限制性算子时难以实现,并会引入显著开销。本文开发了性能优化方法,以高效支持基于FHE的推理流程中的嵌入查找。首先,我们提出一种基于客户端数字分解的嵌入压缩技术,相比现有最优方法实现了56倍的加速。其次,我们提出一种多嵌入打包策略,支持跨多个表的密文SIMD并行查找。关键在于,我们的目标不仅是正确检索嵌入,而且要以一种能够生成与下游服务器端加密计算直接兼容的密文输出布局的方式实现。我们将该方法命名为HE-LRM,并展示了端到端的加密DLRM推理。我们在UCI(健康预测)和Criteo(点击预测)数据集上评估HE-LRM,在单线程CPU上分别实现了24秒和489秒的推理延迟。最后,虽然我们的评估聚焦于DLRM,但我们也研究并将嵌入查找原语应用于其他模型(如LLM),这些模型需要批量和单嵌入查找。