Dataflow architectures are growing in popularity due to their potential to mitigate the challenges posed by the memory wall inherent to the Von Neumann architecture. At the same time, high-level synthesis (HLS) has demonstrated its efficacy as a design methodology for generating efficient dataflow architectures within a short development cycle. However, existing HLS tools rely on developers to explore the vast dataflow design space, ultimately leading to suboptimal designs. This phenomenon is especially concerning as the size of the HLS design grows. To tackle these challenges, we introduce HIDA, a new scalable and hierarchical HLS framework that can systematically convert an algorithmic description into a dataflow implementation on hardware. We first propose a collection of efficient and versatile dataflow representations for modeling the hierarchical dataflow structure. Capitalizing on these representations, we develop an automated optimizer that decomposes the dataflow optimization problem into multiple levels based on the inherent dataflow hierarchy. Using FPGAs as an evaluation platform, working with a set of neural networks modeled in PyTorch, HIDA achieves up to 8.54$\times$ higher throughput compared to the state-of-the-art (SOTA) HLS optimization tool. Furthermore, despite being fully automated and able to handle various applications, HIDA achieves 1.29$\times$ higher throughput over the SOTA RTL-based neural network accelerators on an FPGA.
翻译:数据流架构因其在缓解冯·诺依曼架构固有的“存储墙”挑战方面的潜力而日益普及。同时,高层次综合(HLS)已被证明是一种在短开发周期内生成高效数据流架构的有效设计方法。然而,现有HLS工具依赖开发者探索庞大的数据流设计空间,最终导致设计次优。随着HLS设计规模的增长,这一问题尤为突出。为应对这些挑战,我们提出了HIDA——一种新型可扩展的层次化HLS框架,能够系统地将算法描述转换为硬件上的数据流实现。我们首先提出了一组高效且通用的数据流表示方法,用于建模层次化数据流结构。基于这些表示,我们开发了一个自动化优化器,根据数据流的固有层次将数据流优化问题分解为多个层级。以FPGA作为评估平台,结合一组用PyTorch建模的神经网络,HIDA相比最先进的HLS优化工具实现了最高8.54×的吞吐量提升。此外,尽管完全自动化且能处理多种应用,HIDA在FPGA上相较于最先进的基于RTL的神经网络加速器仍实现了1.29×的吞吐量提升。