Private information retrieval (PIR) allows private database queries; however, it is hindered by intense server-side computation and memory traffic. Numerous modern lattice-based PIR protocols consist of three phases: ExpandQuery (expanding a query into encrypted indices), RowSel (encrypted row selection), and ColTor (recursive "column tournament" for final selection). ExpandQuery and ColTor primarily perform number-theoretic transforms (NTTs), whereas RowSel reduces to large-scale independent matrix-matrix multiplications (GEMMs). GPUs are well suited for these tasks when combined with multi-client batching, which is necessary for high throughput. However, batching fundamentally reshapes the performance bottlenecks: while it amortizes database access costs, it expands working sets beyond the L2 cache capacity, causing divergent memory access behavior and excessive DRAM traffic. We present GPIR, a GPU-accelerated PIR system that rethinks kernel design, data layout, and execution scheduling. We introduce a stage-aware hybrid execution model that dynamically switches between operation-level kernels, which execute each primitive operation separately, and stage-level kernels, which fuse all operations within a stage into a single kernel to maximize on-chip data reuse. For RowSel, we resolve the mismatch between NTT-driven layouts and tiled GEMMs using a transposed-layout design with fine-grained pipelining. We further extend GPIR to multi-GPU systems, scaling throughput and database capacity with negligible communication overhead. GPIR achieves up to 297.2x higher throughput than PIRonGPU, the state-of-the-art GPU implementation.
翻译:暂无翻译