Yet another Improvement of Plantard Arithmetic for Faster Kyber on Low-end 32-bit IoT Devices

This paper presents another improved version of Plantard arithmetic that could speed up Kyber implementations on two low-end 32-bit IoT platforms (ARM Cortex-M3 and RISC-V) without SIMD extensions. Specifically, we further enlarge the input range of the Plantard arithmetic without modifying its computation steps. After tailoring the Plantard arithmetic for Kyber's modulus, we show that the input range of the Plantard multiplication by a constant is at least 2.45 times larger than the original design in TCHES2022. Then, two optimization techniques for efficient Plantard arithmetic on Cortex-M3 and RISC-V are presented. We show that the Plantard arithmetic supersedes both Montgomery and Barrett arithmetic on low-end 32-bit platforms. With the enlarged input range and the efficient implementation of the Plantard arithmetic on these platforms, we propose various optimization strategies for NTT/INTT. We minimize or entirely eliminate the modular reduction of coefficients in NTT/INTT by taking advantage of the larger input range of the proposed Plantard arithmetic on low-end 32-bit platforms. Furthermore, we propose two memory optimization strategies that reduce 23.50% to 28.31% stack usage for the speed-version Kyber implementation when compared to its counterpart on Cortex-M4. The proposed optimizations make the speed-version implementation more feasible on low-end IoT devices. Thanks to the aforementioned optimizations, our NTT/INTT implementation shows considerable speedups compared to the state-of-the-art work. Overall, we demonstrate the applicability of the speed-version Kyber implementation on memory-constrained IoT platforms and set new speed records for Kyber on these platforms.

翻译：本文提出Plantard算法的又一改进版本，可在无SIMD扩展的低端32位物联网平台（ARM Cortex-M3与RISC-V）上加速Kyber实现。具体而言，我们在不修改计算步骤的前提下进一步扩大了Plantard算法的输入范围。针对Kyber模数定制Plantard算法后，我们发现常数乘法的Plantard算法输入范围至少是TCHES2022原始设计的2.45倍。随后提出两种在Cortex-M3与RISC-V上实现高效Plantard算法的优化技术。实验表明，在低端32位平台上，Plantard算法性能优于Montgomery算法与Barrett算法。借助扩大的输入范围及上述平台上Plantard算法的高效实现，我们提出多种NTT/INTT优化策略：通过利用所提Plantard算法在低端32位平台上的更大输入范围，我们最小化甚至完全消除了NTT/INTT中系数的模约减操作。此外，我们提出两种内存优化策略，与Cortex-M4上的对应实现相比，速度优化版Kyber实现的堆栈使用量减少了23.50%至28.31%。上述优化使得速度优化版实现更适用于低端物联网设备。得益于前述优化，我们的NTT/INTT实现相较于现有最优工作展现出显著加速效果。总体而言，我们证明了速度优化版Kyber实现可在内存受限的物联网平台上应用，并刷新了Kyber在这些平台上的速度记录。