Digital signatures are fundamental building blocks in various protocols to provide integrity and authenticity. The development of the quantum computing has raised concerns about the security guarantees afforded by classical signature schemes. CRYSTALS-Dilithium is an efficient post-quantum digital signature scheme based on lattice cryptography and has been selected as the primary algorithm for standardization by the National Institute of Standards and Technology. In this work, we present a high-throughput GPU implementation of Dilithium. For individual operations, we employ a range of computational and memory optimizations to overcome sequential constraints, reduce memory usage and IO latency, address bank conflicts, and mitigate pipeline stalls. This results in high and balanced compute throughput and memory throughput for each operation. In terms of concurrent task processing, we leverage task-level batching to fully utilize parallelism and implement a memory pool mechanism for rapid memory access. Considering the impact of varying repetition numbers in Dilithium on the overall execution time and hardware utilization, we propose a dynamic task scheduling mechanism to improve multiprocessor occupancy and significantly reduce execution time. Furthermore, we apply asynchronous computing and launch multiple streams to hide data transfer latencies and maximize the computing capabilities of both CPU and GPU. Across all three security levels, our GPU implementation can concurrently compute ten thousand tasks in less than 32 miliseconds for signing and 15 miliseconds for verification on both commercial and server-grade GPUs. This achieves microsecond-level amortized execution times for each task, offering a high-throughput and quantum-resistant solution suitable for a wide array of applications in real systems.
翻译:数字签名是各类协议中确保完整性和真实性的基础构建模块。量子计算的发展对经典签名方案所提供安全保障的可靠性提出了质疑。CRYSTALS-Dilithium作为一种基于格密码的高效后量子数字签名方案,已被美国国家标准与技术研究院选定为主要标准化算法。本文提出了一种基于GPU的高通量Dilithium实现方案。针对单个操作,我们采用一系列计算与内存优化手段,以突破顺序约束、降低内存占用与输入输出延迟、解决存储体冲突问题并缓解流水线停顿,从而实现各个操作的高且均衡的计算吞吐量与内存吞吐量。在并发任务处理方面,我们利用任务级批处理充分发掘并行性,并实现内存池机制以支持快速内存访问。考虑到Dilithium中重复次数变化对整体执行时间与硬件利用率的影响,我们提出了一种动态任务调度机制,以提升多处理器占用率并显著缩短执行时间。此外,我们应用异步计算并启动多个流,以隐藏数据传输延迟并最大化CPU与GPU的计算能力。在三种安全级别下,我们的GPU实现能在商业级与服务器级GPU上于32毫秒内完成一万个签名任务、15毫秒内完成一万个验证任务的并发计算,每个任务的平均执行时间达到微秒级,从而为实际系统中的各类应用提供了一种高通量且抗量子的解决方案。