FPGAs are well-suited for dataflow architectures that process data in a streaming or pipelined manner, thus satisfying the high computational and communication demands of emerging applications. However, manually implementing an efficient dataflow architecture for large-scale applications is still challenging, even for specialists who use high-level synthesis (HLS) to simplify FPGA programming. To address this, we introduce CODO, an automated compiler that generates feasible and efficient dataflow accelerators on FPGAs. CODO features a systematic method for detecting and eliminating both coarse-grained and fine-grained dataflow violations. Building on this, CODO performs both on- and off-chip data movement optimizations to maximize transfer efficiency. To guarantee a higher design quality, CODO performs automatic scheduling to generate high-performance dataflow accelerators, ensuring a balanced performance-resource trade-off. Synthesis results show that CODO delivers $1.45\times$ to $4.52\times$ latency speedups on typical computation kernels and $3.7\times$ to $33.8\times$ speedups on DNN models compared to SOTA frameworks. In on-board evaluations, CODO achieves $7.3\times$ average speedup on CNN models and $2.07\times$ average speedup on the GPT-2 model over SOTA frameworks. The compiler is open-sourced at https://github.com/sjtu-zhao-lab/codo-artifact.
翻译:FPGA非常适合于以流式或流水线方式处理数据的流式架构,从而满足新兴应用对高计算量和通信量的需求。然而,即便对于使用高层次综合(HLS)来简化FPGA编程的专家而言,为大规模应用手动实现高效的流式架构仍然具有挑战性。为此,我们提出CODO,一种能在FPGA上自动生成可行且高效的数据流加速器的编译器。CODO提出了一种系统性的方法来检测和消除粗粒度和细粒度两类数据流违规。在此基础上,CODO执行片上与片外数据移动优化,以最大化传输效率。为保证更高的设计质量,CODO进行自动调度以生成高性能的数据流加速器,确保性能与资源之间的平衡权衡。综合结果表明,与现有最先进框架相比,CODO在典型计算核上实现了$1.45\times$至$4.52\times$的延迟加速,在深度神经网络(DNN)模型上实现了$3.7\times$至$33.8\times$的加速。在板级评估中,CODO在卷积神经网络(CNN)模型上比现有最先进框架平均快$7.3\times$,在GPT-2模型上平均快$2.07\times$。该编译器已在https://github.com/sjtu-zhao-lab/codo-artifact开源。