Mestra: Exploring Migration on Virtualized CGRAs

As modern Coarse Grain Reconfigurable Arrays (CGRAs) grow in size, efficient utilization of the available fabric by a single application becomes increasingly difficult. Existing CGRA mappers either fail to utilize the available fabric or rely on rigid static code transformations with limited adaptability. Multi-tenant CGRAs have emerged as a promising solution to increase hardware utilization, but current attempts fail to address key challenges such as fabric fragmentation and live migration. To address this gap, we present Mestra, an end-to-end system for CGRA multi-tenancy that supports dynamic scheduling and resource allocation in a shared environment. Mestra addresses fabric fragmentation caused by kernels completing out of order by supporting both stateless and stateful live kernel migration as a de-fragmentation mechanism. We assess our solution on an Alveo-U280 data-center-grade FPGA card, reporting area, frequency, and power. Performance is evaluated using routines from the PolyBench benchmark suite and kernels derived from common machine learning operators. Results show that spatial sharing of the available fabric across multiple users improves workload makespan by up to 70.48%, while live kernel migration reduces tail latency on fragmented layouts by up to 29.60%. The custom tightly coupled controller and read-back paths required for virtualization and stateful migration introduce a LUT cost of 0.13% per region. Our evaluation reveals that multi-tenancy is important for efficient CGRA utilization, and live kernel migration can further improve performance by recovering fragmented space with minimal hardware cost.

翻译：随着现代粗粒度可重构阵列（CGRA）规模的增长，单个应用程序对可用计算结构的高效利用变得越来越困难。现有的CGRA映射器要么无法充分利用可用计算结构，要么依赖刚性静态代码变换且适应性有限。多租户CGRA已成为提高硬件利用率的有效解决方案，但当前尝试未能解决结构碎片化和实时迁移等关键挑战。为填补这一空白，我们提出Mestra——一个支持共享环境下动态调度与资源分配的CGRA多租户端到端系统。Mestra通过支持无状态和有状态的实时内核迁移作为碎片整理机制，解决因内核乱序完成导致的结构碎片化问题。我们在Alveo-U280数据中心级FPGA卡上评估该方案，报告了面积、频率和功耗数据。性能评估采用PolyBench基准套件中的例程及常见机器学习算子生成的内核。结果表明，跨多用户的空间共享可提升工作负载完成时间最高达70.48%，而实时内核迁移可将碎片化布局的尾延迟降低最高29.60%。虚拟化与状态迁移所需的定制紧耦合控制器及回读路径每个区域仅增加0.13%的LUT开销。我们的评估揭示多租户对高效利用CGRA至关重要，而实时内核迁移能以极小硬件代价回收碎片化空间并进一步提升性能。