Private information retrieval (PIR) allows private database queries but is hindered by intense server-side computation and memory traffic. Modern lattice-based PIR protocols typically involve three phases: ExpandQuery (expanding a query into encrypted indices), RowSel (encrypted row selection), and ColTor (recursive "column tournament" for final selection). ExpandQuery and ColTor primarily perform number-theoretic transforms (NTTs), whereas RowSel reduces to large-scale independent matrix-matrix multiplications (GEMMs). GPUs are theoretically ideal for these tasks, provided multi-client batching is used to achieve high throughput. However, batching fundamentally reshapes performance bottlenecks; while it amortizes database access costs, it expands working sets beyond the L2 cache capacity, causing divergent memory behaviors and excessive DRAM traffic. We present GPIR, a GPU-accelerated PIR system that rethinks kernel design, data layout, and execution scheduling. We introduce a stage-aware hybrid execution model that dynamically switches between operation-level kernels, which execute each primitive operation separately, and stage-level kernels, which fuse all operations within a protocol stage into a single kernel to maximize on-chip data reuse. For RowSel, we identify a performance gap caused by a structural mismatch between NTT-driven data layouts and tiled GEMM access patterns, which is exacerbated by multi-client batching. We resolve this through a transposed-layout GEMM design and fine-grained pipelining. Finally, we extend GPIR to multi-GPU systems, scaling both query throughput and database capacity with negligible communication overhead. GPIR achieves up to 305.7x higher throughput than PIRonGPU, the state-of-the-art GPU implementation.
翻译:私有信息检索(PIR)允许私有数据库查询,但受限于服务器端密集的计算和内存流量。现代基于格密码的PIR协议通常包含三个阶段:ExpandQuery(将查询扩展为加密索引)、RowSel(加密行选择)和ColTor(用于最终选择的递归“列锦标赛”)。ExpandQuery和ColTor主要执行数论变换(NTT),而RowSel则归结为大规模独立矩阵-矩阵乘法(GEMM)。理论上,GPU非常适合这些任务,前提是采用多客户端批处理以实现高吞吐量。然而,批处理从根本上重塑了性能瓶颈:虽然它分摊了数据库访问成本,但将工作集扩展到L2缓存容量之外,导致内存行为发散和过多的DRAM流量。我们提出GPIR,这是一个GPU加速的PIR系统,重新设计了内核、数据布局和执行调度。我们引入了一种阶段感知的混合执行模型,该模型在操作级内核(分别执行每个原始操作)和阶段级内核(将协议阶段内的所有操作融合到单个内核中以最大化片上数据复用)之间动态切换。对于RowSel,我们发现由NTT驱动的数据布局与分块GEMM访问模式之间的结构不匹配导致了性能差距,而多客户端批处理加剧了这一问题。我们通过转置布局GEMM设计和细粒度流水线解决了这一问题。最后,我们将GPIR扩展到多GPU系统,在可忽略的通信开销下扩展了查询吞吐量和数据库容量。相比最先进的GPU实现PIRonGPU,GPIR实现了高达305.7倍的吞吐量提升。