Low-Stack HAETAE for Memory-Constrained Microcontrollers

We present a low-stack implementation of the module-lattice signature scheme HAETAE, targeting microcontrollers with 8 kB-16 kB of available SRAM. On such devices, peak stack usage is often the binding constraint, and HAETAE's hyperball-based sampler, large transient polynomial vectors, and variable-length signature payloads (hint and high-bits arrays) pose a particular challenge. To address this we introduce (i) Rejection-aware pass decomposition, which isolates encoding to the post-acceptance path; (ii) Component-level early rejection, which short-circuits the response computation when a partial norm already exceeds the bound; and (iii) Reverse-order streaming entropy coding using range Asymmetric Numeral Systems (rANS), which eliminates full hint and high-bits staging buffers. Combined with streamed matrix generation, a two-pass hyperball sampler with streaming Gaussian backend, and row-streamed verification, these techniques bring Signing stack from 71 kB-141 kB in the reference implementation down to 5.8 kB-6.0 kB, key generation to 4.7 kB-5.7 kB, and verification to 4.7 kB-4.8 kB across all three security levels. Our pure C implementation covers all three security levels (HAETAE-2/3/5), whose optimization paths differ due to the public-key domain (d>0 vs. d=0) and rejection structure. We implement our optimization on a Nucleo-L4R5ZI and compare to the reference pqm4 (for HAETAE-2 and -3) and a recently published memory-optimized implementation (targeting HAETAE-5 only). We reduce HAETAE-2, -3, and -5 stack by respectively 75, 86 and 8 % for key generation, 92, 95 and 24 % for signature generation, and 85, 91 and 22 % for verification. Depending on the parameter set, this impacts performance by at most a factor 1.8 and 3.4 for key and signature generation respectively, while even offering a performance improvement up to 18 % for verification. Verification at all security levels fits within 8 kB of RAM (signature buffer + stack) and is 2.34-3.34x faster than ML-DSA m4fstack at each comparable security level. We additionally validate portability under RIOT-OS on ARM Cortex-M4 and RISC-V targets.

翻译：本文针对SRAM仅有8 kB-16 kB的微控制器，提出了一种模块格签名方案HAETAE的低栈实现。在此类设备上，峰值栈使用往往是主要约束，而HAETAE基于超球的采样器、大型瞬态多项式向量以及可变长度签名载荷（提示位数组与高位数组）带来了特殊挑战。为解决此问题，我们引入了：(i) 感知拒绝的遍分解技术，将编码过程隔离至签名接受后的路径；(ii) 组件级早期拒绝机制，当部分范数已超出边界时立即短路响应计算；(iii) 基于范围非对称数字系统（rANS）的逆序流式熵编码，从而消除完整的提示位与高位暂存缓冲区。结合流式矩阵生成、带流式高斯后端的双遍超球采样器以及行流式验证，这些技术将参考实现中71 kB-141 kB的签名栈降至5.8 kB-6.0 kB，密钥生成栈降至4.7 kB-5.7 kB，验证栈降至4.7 kB-4.8 kB（覆盖全部三个安全等级）。我们的纯C实现涵盖所有三个安全等级（HAETAE-2/3/5），其优化路径因公钥域（d>0与d=0）及拒绝结构的不同而存在差异。我们在Nucleo-L4R5ZI开发板上实施优化，并与参考实现pqm4（针对HAETAE-2与-3）及近期发表的内存优化实现（仅针对HAETAE-5）进行对比。在密钥生成中，我们分别将HAETAE-2、-3与-5的栈使用量降低75%、86%与8%；签名生成中降低92%、95%与24%；验证中降低85%、91%与22%。根据参数集的不同，这会导致密钥生成性能最多下降1.8倍，签名生成性能最多下降3.4倍，但验证性能反而提升最高18%。所有安全等级的验证均可适配于8 kB RAM（签名缓冲区+栈），且在每个可比安全等级上比ML-DSA m4fstack快2.34-3.34倍。我们还在ARM Cortex-M4与RISC-V平台上基于RIOT-OS验证了其可移植性。