Capstone: Power-Capped Pipelining for Coarse-Grained Reconfigurable Array Compilers

Coarse-grained reconfigurable arrays (CGRAs) have attracted growing interest because they exhibit performance and energy efficiency competitive with ASICs while maintaining flexibility similar to FPGAs. These properties make CGRAs attractive in accelerator and other power-constrained system contexts. However, modern CGRA compilers aggressively pipeline for frequency and performance improvements, often violating hard power budgets. We empirically show that, in state-of-the-art CGRA compilers such as Cascade, post-place-and-route (post-PnR) pipelining increases power monotonically and ultimately exceeds fixed power caps across diverse workloads. In response, we introduce \emph{Capstone}, a power-aware extension of Cascade that integrates a fast, compiler-resident power model with a user-tunable controller that guides the bitstream selection process towards optimization targets. Capstone predicts per-iteration power directly inside the post-PnR compilation loop and selects one or a small set of PnR configurations such that at least one meets a user-specified power cap. Thus, we shift the objective from indiscriminately maximizing frequency to maximizing safe frequency under a discrete power cap. On a suite of kernels spanning fundamental dense and sparse applications, Capstone meets a power cap and minimizes remaining power headroom while preserving feasible performance. Our results indicate that cap-aware compilation is both necessary and practical, as the compiler can proactively land on cap-compliant points and expose predictable performance under power constraints.

翻译：粗粒度可重构阵列因其在保持与现场可编程门阵列相近灵活性的同时，展现出可与专用集成电路相媲美的性能与能效而受到日益广泛的关注。这些特性使得粗粒度可重构阵列在加速器及其他功耗受限的系统场景中颇具吸引力。然而，现代粗粒度可重构阵列编译器为追求频率与性能提升而采用激进的流水线策略，常常突破严格的功耗预算。我们通过实验证明，在Cascade等先进粗粒度可重构阵列编译器中，布局布线后流水线处理会单调增加功耗，并最终在各种工作负载下超出固定的功耗上限。为此，我们提出Capstone——一种Cascade的功耗感知扩展框架，它集成了驻留在编译器内部的快速功耗模型与用户可调控制器，该控制器能引导位流选择过程朝向优化目标。Capstone在布局布线后编译循环中直接预测每次迭代的功耗，并选择一个或一组布局布线配置，使得至少有一种配置满足用户指定的功耗上限。由此，我们将优化目标从无差别地最大化频率转变为在离散功耗上限下最大化安全频率。在一系列涵盖稠密与稀疏基础应用的核函数测试中，Capstone在满足功耗上限的同时最小化剩余功耗裕度，并保持可行的性能。我们的结果表明，功耗感知编译既是必要的也是可行的，因为编译器能够主动定位符合功耗约束的设计点，并在功耗限制下展现出可预测的性能。