Yet another Improvement of Plantard Arithmetic for Faster Kyber on Low-end 32-bit IoT Devices

This paper presents another improved version of Plantard arithmetic that could speed up Kyber implementations on two low-end 32-bit IoT platforms (ARM Cortex-M3 and RISC-V) without SIMD extensions. Specifically, we further enlarge the input range of the Plantard arithmetic without modifying its computation steps. After tailoring the Plantard arithmetic for Kyber's modulus, we show that the input range of the Plantard multiplication by a constant is at least 2.45 times larger than the original design in TCHES2022. Then, two optimization techniques for efficient Plantard arithmetic on Cortex-M3 and RISC-V are presented. We show that the Plantard arithmetic supersedes both Montgomery and Barrett arithmetic on low-end 32-bit platforms. With the enlarged input range and the efficient implementation of the Plantard arithmetic on these platforms, we propose various optimization strategies for NTT/INTT. We minimize or entirely eliminate the modular reduction of coefficients in NTT/INTT by taking advantage of the larger input range of the proposed Plantard arithmetic on low-end 32-bit platforms. Furthermore, we propose two memory optimization strategies that reduce 23.50% to 28.31% stack usage for the speed-version Kyber implementation when compared to its counterpart on Cortex-M4. The proposed optimizations make the speed-version implementation more feasible on low-end IoT devices. Thanks to the aforementioned optimizations, our NTT/INTT implementation shows considerable speedups compared to the state-of-the-art work. Overall, we demonstrate the applicability of the speed-version Kyber implementation on memory-constrained IoT platforms and set new speed records for Kyber on these platforms.

翻译：本文提出Plantard算术的另一种改进版本，可在两款无SIMD扩展的低端32位物联网平台（ARM Cortex-M3与RISC-V）上加速Kyber实现。具体而言，我们在不修改计算步骤的前提下，进一步扩大了Plantard算术的输入范围。经针对Kyber模数定制化调整后，常数乘法的Plantard算术输入范围较TCHES2022原始设计至少提升2.45倍。随后，本文提出两种适用于Cortex-M3和RISC-V的高效Plantard算术优化技术。实验表明，Plantard算术在低端32位平台上全面优于Montgomery算术和Barrett算术。利用扩大的输入范围及上述平台的高效实现，我们提出NTT/INTT的多项优化策略：通过充分利用低端32位平台上改进Plantard算术的更大输入范围，最小化或完全消除NTT/INTT中系数的模约减。此外，我们还提出两种内存优化策略，相较Cortex-M4上的对应实现，速度优先版Kyber实现的栈空间使用量降低23.50%至28.31%。这些优化使速度优先版实现在低端物联网设备上更具可行性。得益于上述优化，我们的NTT/INTT实现相比现有最优工作展现出显著加速。总体而言，我们证明了速度优先版Kyber实现可在内存受限物联网平台上的适用性，并在这些平台上创下Kyber速度新纪录。