Multidimensional loop kernels often suffer from control overhead that can dominate execution time on parallel loop accelerators. Tightly Coupled Processor Arrays (TCPAs) offload loop control to a global controller (GC), but existing approaches still require hundreds of control signals. We propose a method to derive and aggressively reduce these control conditions from a polyhedral representation of the iteration space, achieving reductions of 15x to 45x in control signals across several benchmarks. We introduce a lightweight GC architecture that evaluates conditions as unions of polyhedra using bounded evaluation units, requiring hardware comparable to a single processing element. Control signals are distributed throughout the array with a minimal number of delay elements resulting in zero-overhead loop control. Our evaluation on PolyBench kernels shows that the entire control flow requires < 10 % of the total array resources.
翻译:多维循环核常因控制开销而在并行循环加速器上占据主导执行时间。紧密耦合处理器阵列(TCPAs)将循环控制任务卸载至全局控制器(GC),但现有方法仍需数百个控制信号。我们提出一种方法,能从迭代空间的仿射表示中推导并激进地削减这些控制条件,在多个基准测试中实现15倍至45倍的控制信号缩减。我们引入一种轻量级GC架构,该架构通过有限评估单元将条件表示为仿射多面体的并集进行评估,所需硬件与单个处理单元相当。控制信号以最少延迟单元数量分布至整个阵列,从而实现零开销循环控制。在PolyBench内核上的评估表明,整个控制流所需资源不足阵列总资源的10%。