Domain-specific accelerators deliver exceptional performance on their target workloads through fabrication-time orchestrated datapaths. However, such specialized architectures often exhibit performance fragility when exposed to new kernels or irregular input patterns. In contrast, programmable architectures like FPGAs, CGRAs, and GPUs rely on compile-time orchestration to support a broader range of applications; but they are typically less efficient under irregular or sparse data. Pushing the boundaries of programmable architectures requires designs that can achieve efficiency and high-performance on par with specialized accelerators while retaining the agility of general-purpose architectures. We introduce Canon, a parallel architecture that bridges the gap between specialized and general purpose architectures. Canon exploits data-level and instruction-level parallelism through its novel design. First, it employs a novel dynamic data-driven orchestration mechanism using programmable Finite State Machines (FSMs). These FSMs are programmed at compile time to encode high-level dataflow per state and translate incoming meta-information (e.g., sparse coordinates) into control instructions at runtime. Second, Canon introduces a time-lapsed SIMD execution in which instructions are issued across a row of processing elements over several cycles, creating a staggered pipelined execution. These innovations amortize control overhead, allowing dynamic instruction changes while constructing a continuously evolving dataflow that maximizes parallelism. Experimental evaluation shows that Canon delivers high performance across diverse data-agnostic and data-driven kernels while achieving efficiency comparable to specialized accelerators, yet retaining the flexibility of a general-purpose architecture.
翻译:领域专用加速器通过制造时编排的数据通路在其目标工作负载上展现出卓越性能。然而,当面对新内核或不规则输入模式时,此类专用架构常表现出性能脆弱性。相比之下,FPGA、CGRA和GPU等可编程架构依赖编译时编排来支持更广泛的应用范围,但在不规则或稀疏数据下通常效率较低。要突破可编程架构的边界,需要设计出既能实现与专用加速器相当的高效高性能,又能保持通用架构敏捷性的方案。我们提出Canon——一种弥合专用架构与通用架构间鸿沟的并行架构。Canon通过其创新设计同时开发数据级与指令级并行性:首先,它采用基于可编程有限状态机的新型动态数据驱动编排机制,这些FSM在编译时被编程以编码各状态的高级数据流,并在运行时将传入元信息(如稀疏坐标)转换为控制指令;其次,Canon引入时滞SIMD执行机制,指令在多个周期内跨处理单元阵列发射,形成交错流水线执行。这些创新分摊了控制开销,在构建持续演化的数据流以最大化并行性的同时,支持动态指令变更。实验评估表明,Canon在多种数据无关与数据驱动内核中均实现高性能,在保持通用架构灵活性的同时,获得了与专用加速器相媲美的能效。