Déjà Vu Packing: Optimizing FPGA Logic Clustering Runtime via Pattern Memoization

Implementing a digital circuit on an FPGA fabric requires clustering technology-mapped netlist primitives into coarser-granularity blocks that can be directly mapped to the physical resources available on the FPGA. As the architecture of FPGA logic blocks (LBs) has grown in complexity, with sophisticated logic elements (LEs) and highly irregular local interconnect, this packing problem has become more challenging. To ensure the feasibility of intracluster routing, the computer-aided design (CAD) tools must solve a costly multi-source multi-sink routing problem for each candidate cluster. In this paper, we first show that such packing legality checks consume a significant portion of the CAD flow runtime for LB architectures with complex LEs and local routing structures resembling modern commercial FPGAs. We demonstrate that the packing stage constitutes 58% and 94% of the entire Versatile Place and Route (VPR) flow runtime on average when mapping a wide variety of benchmarks to the AMD 7-series-like and Altera Stratix-10-like VTR architecture captures, respectively. By analyzing the packing algorithm behavior, we observe that a significant fraction of the attempted packed clusters are repetitions of a much smaller number of packing patterns, and therefore many of the packing legality checks are redundant and could be skipped. To this end, we introduce our Déjà Vu packing approach, which leverages a novel packing signature tree data structure that enables efficient identification of recurring packing patterns and memoization of their legality check outcomes. Our approach speeds up the packing by up to 13.4x and 29.3x, with an average of 3.7x and 6.9x, across the evaluated benchmarks on the 7-series and Stratix 10 architectures. These packing runtime gains result in a significant 1.6x and 5.3x average reduction in end-to-end VPR runtime, while maintaining quality of results.

翻译：在FPGA上实现数字电路需要将技术映射后的网表基元聚类为粒度更粗的模块，这些模块可直接映射到FPGA的物理资源上。随着FPGA逻辑块（LB）架构日趋复杂——包含精密的逻辑单元（LE）和高度不规则的本地互连——这一打包问题变得更具挑战性。为确保簇内布线的可行性，计算机辅助设计（CAD）工具必须针对每个候选簇求解代价高昂的多源多宿路由问题。本文首先证明，对于具有复杂LE和类似现代商用FPGA本地路由结构的LB架构，此类打包合法性检查消耗了CAD流程运行时的重要组成部分。我们表明，在将多种基准测试映射到类似AMD 7系列和Altera Stratix-10的VTR架构捕获时，打包阶段平均占整个Versatile Place and Route（VPR）流程运行时的58%和94%。通过分析打包算法行为，我们发现大量尝试的打包簇是数量更少的打包模式重复结果，因此许多打包合法性检查是冗余的且可跳过。为此，我们提出记忆重载打包方法，该方法利用新型打包签名字树数据结构，能够高效识别重复出现的打包模式并记忆其合法性检查结果。在7系列和Stratix 10架构的评估基准测试中，我们的方法将打包速度分别提升至13.4倍和29.3倍，平均加速比达3.7倍和6.9倍。这些打包运行时间的收益使得端到端VPR运行时间平均降低1.6倍和5.3倍，同时保持结果质量不变。

相关内容

FPGA

关注 18

FPGA：ACM/SIGDA International Symposium on Field-Programmable Gate Arrays。 Explanation：ACM/SIGDA现场可编程门阵列国际研讨会。 Publisher：ACM。 SIT： http://dblp.uni-trier.de/db/conf/fpga/

《基于 FPGA 的 PUF 硬件安全设计和易受机器学习与基于群体智能的 ANN 算法攻击影响的脆弱性分析》151页

专知会员服务

21+阅读 · 2024年3月28日

基于机器学习的FPGA电子设计自动化技术研究综述

专知会员服务

21+阅读 · 2022年11月22日

【ICML2022】DepthShrinker:一种新的压缩范式，用于提高紧凑神经网络的实际硬件效率

专知会员服务

11+阅读 · 2022年6月5日

南洋理工北大等首篇《GPU数据中心中深度学习工作负载调度》综述论文，35页pdf全面阐述DL训练与推理GPU调度技术进展

专知会员服务

46+阅读 · 2022年5月27日