Implementation and Optimization of HQC Decoding on NPU-Integrated Devices

Hamming Quasi-Cyclic (HQC) has been selected by NIST for standardization as an additional code-based key-encapsulation mechanism, providing algorithmic diversity alongside lattice-based post-quantum cryptography. Efficient deployment of HQC on mobile and embedded platforms, however, requires careful optimization of its decoding procedure, whose Reed-Muller and Reed-Solomon components dominate the computational cost. This paper studies HQC decoding on Qualcomm Hexagon processors in NPU-integrated devices, focusing on the Hexagon Vector eXtensions (HVX) backend rather than a tensor-inference engine. We observe that HQC decoding naturally exposes vector-structured computation, including Reed-Muller reliability vectors, Hadamard-transform coefficients, Reed-Solomon syndrome vectors, finite-field products, and packed support-point evaluations. Based on this observation, we redesign the dominant decoding kernels around HVX-friendly data layouts and execution patterns, including a vectorized Reed-Muller Hadamard transform, scalar-equivalent peak selection, HVX-oriented finite-field arithmetic, vectorized syndrome computation, and shortened-support locator-root evaluation. We implement and evaluate the optimized decoder using both Hexagon simulator measurements and real-device experiments on a Snapdragon~8 Gen~2 hardware development kit. The results show that Hexagon/HVX-assisted decoding substantially reduces latency and energy consumption, improving energy efficiency by up to $18.13\times$ while significantly offloading host CPU work. These results indicate that NPU-integrated mobile platforms can serve as effective backends for structured post-quantum cryptographic decoding when the underlying kernels are reformulated around vector execution.

翻译：汉明准循环（HQC）已被美国国家标准与技术研究院（NIST）选定为额外基于编码的密钥封装机制标准化方案，与基于格的后量子密码学形成算法多样性。然而，在移动与嵌入式平台上高效部署HQC需要对其解码过程进行精细优化，其中Reed-Muller与Reed-Solomon组件占据了主要计算成本。本文研究了NPU集成设备中高通Hexagon处理器上的HQC解码，重点关注Hexagon矢量扩展（HVX）后端而非张量推理引擎。我们观察到HQC解码自然呈现矢量化结构计算，包括Reed-Muller可靠性向量、Hadamard变换系数、Reed-Solomon综合征向量、有限域乘积以及压缩支持点评估。基于此观察，我们围绕HVX友好的数据布局与执行模式重新设计了主要解码内核，包括矢量化Reed-Muller Hadamard变换、标量等效峰值选择、面向HVX的有限域运算、矢量化综合征计算以及压缩支持定位根评估。我们通过Hexagon模拟器测量与骁龙8 Gen 2硬件开发套件的实际设备实验，实现并评估了优化后的解码器。结果表明，Hexagon/HVX辅助解码显著降低了延迟与能耗，能效提升高达$18.13\times$，同时大幅卸载主机CPU工作负载。这些结果表明，当底层内核围绕矢量执行重新设计时，NPU集成移动平台可作为结构化后量子密码解码的有效后端。