Automatic multi-dimensional pipelining for high-level synthesis of dataflow accelerators

In recent years, there has been a surging demand for edge computing of image processing and machine learning workloads. This has reignited interest in the development of custom hardware accelerators that can deliver enhanced performance and improved energy efficiency. These workloads frequently demonstrate affine memory accesses and constant loop bounds. In this paper, we introduce an ILP-based automatic scheduler for high-level synthesis, with a specific emphasis on aggressive pipelining to enhance parallelism. In this study, we propose a unified Integer Linear Programming (ILP) formulation that can identify pipelining opportunities along multiple loop and scalar dimensions. Our multi-dimensional pipelining technique encompasses both inner loop pipelining and dataflow optimizations of Vitis HLS, while also being capable of handling more general memory access patterns compared to the dataflow optimization in Vitis HLS. Furthermore, our approach enables the generation of statically scheduled circuits, leading to improved resource efficiency. We have integrated our scheduler into a high-level synthesis compiler framework (HIR) based on MLIR and conducted performance evaluations. Our findings reveal that our scheduler, in comparison to Vitis HLS, can achieve more aggressive pipelining across multiple producer-consumer loop nests, resulting in reduced overall execution latency. The producer-consumer pipelined execution facilitated by our scheduler yields an average performance improvement of 2.42X across a set of representative benchmarks when compared to only loop pipelining. Furthermore, we achieved an average performance improvement of 1.30X over Vitis HLS with dataflow optimizations.

翻译：近年来，图像处理和机器学习工作负载的边缘计算需求激增，这重新激发了人们对开发能够实现更高性能和能效的定制硬件加速器的兴趣。此类工作负载通常表现出仿射内存访问和恒定循环边界。本文提出了一种基于整数线性规划（ILP）的高层次综合自动调度器，重点通过激进流水线技术提升并行性。我们提出了一种统一的整数线性规划（ILP）公式，能够识别沿多个循环维度和标量维度的流水线机会。我们的多维流水线技术涵盖了Vitis HLS的内循环流水线和数据流优化，同时能够处理比Vitis HLS数据流优化更通用的内存访问模式。此外，我们的方法能够生成静态调度电路，从而提升资源效率。我们将该调度器集成到基于MLIR的高层次综合编译器框架（HIR）中，并进行了性能评估。结果表明，与Vitis HLS相比，我们的调度器能够在多个生产者-消费者循环嵌套中实现更激进的流水线化，从而降低整体执行延迟。与仅使用循环流水线相比，通过我们的调度器实现的生产者-消费者流水线执行在一组代表性基准测试中平均性能提升2.42倍。此外，与采用数据流优化的Vitis HLS相比，我们实现了平均1.30倍的性能提升。