Hardware development relies on simulations, particularly cycle-accurate RTL (Register Transfer Level) simulations, which consume significant time. As single-processor performance grows only slowly, conventional, single-threaded RTL simulation is becoming less practical for increasingly complex chips and systems. A solution is parallel RTL simulation, where ideally, simulators could run on thousands of parallel cores. However, existing simulators can only exploit tens of cores. This paper studies the challenges inherent in running parallel RTL simulation on a multi-thousand-core machine (the Graphcore IPU, a 1472-core machine). Simulation performance requires balancing three factors: synchronization, communication, and computation. We experimentally evaluate each metric and analyze how it affects parallel simulation speed, drawing on contrasts between the large-scale IPU and smaller but faster x86 systems. Using this analysis, we build Parendi, an RTL simulator for the IPU. It distributes RTL simulation across 5888 cores on 4 IPU sockets. Parendi runs large RTL designs up to 4x faster than a powerful, state-of-the-art x86 multicore system.
翻译:硬件开发依赖仿真,尤其是周期精确的RTL(寄存器传输级)仿真,这类仿真耗时巨大。随着单处理器性能增长缓慢,传统单线程RTL仿真对于日益复杂的芯片和系统已变得不切实际。并行RTL仿真是解决方案之一,理想情况下,仿真器可在数千个并行核心上运行。然而,现有仿真器仅能利用数十个核心。本文研究了在多核机器(Graphcore IPU,一款1472核机器)上运行并行RTL仿真所固有挑战。仿真性能需平衡三个因素:同步、通信与计算。我们通过实验评估每个指标,并基于大规模IPU与更小但更快的x86系统之间的对比,分析其对并行仿真速度的影响。基于此分析,我们构建了Parendi——一款面向IPU的RTL仿真器。它可将RTL仿真分布在4个IPU插座上的5888个核心中。与当前先进且强大的x86多核系统相比,Parendi运行大型RTL设计的速度提升可达4倍。