While GPUs dominate massively parallel computing through the single-instruction, multiple-thread (SIMT) programming model, their underlying single-instruction, multiple-data (SIMD) execution incurs substantial energy overhead from frequent register file (RF) accesses and complex control logic. We present DICE, a novel architecture that addresses these inefficiencies by replacing the SIMD backend with minimal-overhead, statically scheduled coarse-grained reconfigurable arrays (CGRAs). Unlike SIMD units that execute warps of threads in lockstep, DICE dispatches active threads in a pipelined manner onto the CGRA fabric, where data flow directly between processing elements (PEs), reducing RF accesses for intermediate values. To handle operations with runtime dynamism, such as variable-latency memory loads and data-dependent control flow, while preserving static scheduling, DICE compiles programs into "p-graphs" by partitioning dynamic dependence edges across separate CGRA configurations. DICE further introduces several key optimizations: double-buffered configuration memory to hide reconfiguration latency, compile-time p-graph unrolling to enhance resource utilization, and a temporal memory coalescing unit (TMCU) to merge memory requests from consecutive, pipelined threads. Evaluations on Rodinia benchmarks in Accel-sim demonstrate that DICE reduces register file accesses by 68% on average. With equivalent computation and memory resources, DICE's CGRA Processors (CPs) achieve a geometric mean of 1.77-1.90x dynamic energy efficiency and 42.0%-45.9% average power reduction compared to the modeled NVIDIA Turing Streaming Multiprocessors (SMs), while the full DICE system achieves performance comparable to the modeled Turing GPU baselines. DICE demonstrates that spatial pipeline execution can deliver substantial energy savings without sacrificing performance.
翻译:尽管GPU通过单指令多线程(SIMT)编程模型主导了大规模并行计算,但其底层单指令多数据(SIMD)执行模式因频繁的寄存器文件(RF)访问和复杂的控制逻辑而带来巨大的能耗开销。我们提出DICE,一种新型架构,通过用低开销的静态调度粗粒度可重构阵列(CGRA)替代SIMD后端来应对这些低效问题。与SIMD单元以锁步方式执行线程束不同,DICE将活跃线程以流水线方式调度至CGRA结构上,数据在处理单元(PE)间直接流动,从而减少中间值的RF访问。为处理具有运行时动态特性的操作(如可变延迟内存加载和数据相关的控制流)同时保持静态调度,DICE通过将动态依赖边分割到不同的CGRA配置中,将程序编译为"p-graphs"。DICE进一步引入了若干关键优化:双缓冲配置内存以隐藏重配置延迟、编译时p-graph展开以提升资源利用率,以及时域内存合并单元(TMCU)以合并来自连续流水线线程的内存请求。在Accel-sim中对Rodinia基准测试的评估表明,DICE平均减少68%的寄存器文件访问。在等价计算和内存资源条件下,与建模的NVIDIA Turing流多处理器(SM)相比,DICE的CGRA处理器(CP)实现了几何平均1.77-1.90倍的动态能效提升和42.0%-45.9%的平均功耗降低,而完整的DICE系统实现了与建模的Turing GPU基线相当的性能。DICE证明,空间流水线执行能够在不牺牲性能的情况下实现显著的能耗节省。