Montgomery modular multiplication is widely-used in public key cryptosystems (PKC) and affects the efficiency of upper systems directly. However, modulus is getting larger due to the increasing demand of security, which results in a heavy computing cost. High-performance implementation of Montgomery modular multiplication is urgently required to ensure the highly-efficient operations in PKC. However, existing high-speed implementations still need a large amount redundant computing to simplify the intermediate result. Supports to the redundant representation is extremely limited on Montgomery modular multiplication. In this paper, we propose an efficient parallel variant of iterative Montgomery modular multiplication, called DRMMM, that allows the quotient can be computed in multiple iterations. In this variant, terms in intermediate result and the quotient in each iteration are computed in different radix such that computation of the quotient can be pipelined. Based on proposed variant, we also design high-performance hardware implementation architecture for faster operation. In the architecture, intermediate result in every iteration is denoted as three parts to free from redundant computations. Finally, to support FPGA-based systems, we design operators based on FPGA underlying architecture for better area-time performance. The result of implementation and experiment shows that our method reduces the output latency by 38.3\% than the fastest design on FPGA.
翻译:蒙哥马利模乘算法在公钥密码系统中应用广泛,其效率直接影响上层系统的性能。然而,随着安全需求的增长,模数不断增大,导致计算开销巨大。为确保公钥密码系统的高效运行,亟需实现高性能的蒙哥马利模乘算法。然而,现有的高速实现方案仍需要大量冗余计算来简化中间结果,且对冗余表示的支持极为有限。本文提出一种高效的迭代式蒙哥马利模乘并行变体,称为DRMMM,该算法允许商数在多轮迭代中计算完成。在此变体中,中间结果的各项与每轮迭代的商数采用不同的基进行计算,从而实现商数计算的流水线化。基于所提变体,我们还设计了高性能的硬件实现架构以加速运算。在该架构中,每轮迭代的中间结果被表示为三个部分,从而避免了冗余计算。最后,为支持基于FPGA的系统,我们基于FPGA底层架构设计了运算单元,以获得更优的面积-时间性能。实现与实验结果表明,与FPGA上最快的设计方案相比,我们的方法将输出延迟降低了38.3%。