Kratos: An FPGA Benchmark for Unrolled DNNs with Fine-Grained Sparsity and Mixed Precision

FPGAs offer a flexible platform for accelerating deep neural network (DNN) inference, particularly for non-uniform workloads featuring fine-grained unstructured sparsity and mixed arithmetic precision. To leverage these redundancies, an emerging approach involves partially or fully unrolling computations for each DNN layer. That way, parameter-level and bit-level ineffectual operations can be completely skipped, thus saving the associated area and power. Regardless, unrolled implementations scale poorly and limit the size of a DNN that can be unrolled on an FPGA. This motivates the investigation of new reconfigurable architectures to improve the efficiency of unrolled DNNs, while taking advantage of sparsity and mixed precision. To enable this, we present Kratos: a focused FPGA benchmark of unrolled DNN primitives with varying levels of sparsity and different arithmetic precisions. Our analysis reveals that unrolled DNNs can operate at very high frequencies, reaching the maximum frequency limit of an Arria 10 device. Additionally, we found that substantial area reductions can be achieved through fine-grained sparsity and low bit-width. We build on those results to tailor the FPGA fabric for unrolled DNNs through an architectural case study demonstrating $\sim$2$\times$ area reduction when using smaller LUT sizes within current FPGAs. This paves the way for further exploration of new programmable architectures that are purpose-built for sparse and low-precision unrolled DNNs. Our source code and benchmark are available on github.com/abdelfattah-lab/Kratos-benchmark.

翻译：现场可编程门阵列（FPGA）为深度神经网络（DNN）推理加速提供了一个灵活平台，尤其适用于具有细粒度非结构化稀疏性和混合算术精度的非均匀工作负载。为利用这些冗余特性，一种新兴方法是对每个DNN层的计算进行部分或完全展开。通过这种方式，参数级和比特级的无效操作可被完全跳过，从而节省相应的面积与功耗。然而，展开式实现方案的扩展性较差，限制了可在FPGA上展开的DNN规模。这促使我们探索新型可重构架构，以在利用稀疏性与混合精度的同时提升展开式DNN的效率。为此，我们提出Kratos：一个专注于不同稀疏度级别与算术精度的展开式DNN原语的FPGA基准测试集。我们的分析表明，展开式DNN可在极高频率下运行，达到Arria 10器件的最大频率限制。此外，研究发现通过细粒度稀疏性与低位宽可实现显著的面积缩减。基于这些结果，我们通过架构案例研究对FPGA结构进行定制化改造，证明在当前FPGA中使用更小规模的查找表（LUT）可实现约2倍的面积缩减。这为进一步探索专为稀疏低精度展开式DNN构建的新型可编程架构铺平了道路。我们的源代码与基准测试集已在github.com/abdelfattah-lab/Kratos-benchmark公开。