Dataflow devices represent an avenue towards saving the control and data movement overhead of Load-Store Architectures. Various dataflow accelerators have been proposed, but how to efficiently schedule applications on such devices remains an open problem. The programmer can explicitly implement both temporal and spatial parallelism, and pipelining across multiple processing elements can be crucial to take advantage of the fast on-chip interconnect, enabling the concurrent execution of different program components. This paper introduces canonical task graphs, a model that enables streaming scheduling of task graphs over dataflow architectures. We show how a task graph can be statically analyzed to understand its steady-state behavior, and we use this information to partition it into temporally multiplexed components of spatially executed tasks. Results on synthetic and realistic workloads show how streaming scheduling can increase speedup and device utilization over a traditional scheduling approach.
翻译:数据流器件为节省加载-存储架构的控制与数据移动开销提供了一条可行路径。尽管已有多种数据流加速器被提出,但如何在这些器件上高效调度应用仍是一个开放问题。程序员可显式实现时间与空间并行性,而跨多个处理单元的流水线技术对于利用快速片内互连、实现不同程序组件的并发执行至关重要。本文提出规范任务图这一模型,支持在数据流架构上对流式任务图进行调度。我们展示了如何通过静态分析任务图来理解其稳态行为,并利用该信息将其划分为由空间执行任务构成的时间多路复用组件。在合成与真实负载上的实验结果表明,与传统调度方法相比,流式调度能够提升加速比与器件利用率。