The efficiency improvement of hardware accelerators such as single-instruction-multiple-data (SIMD) and coarse-grained reconfigurable architecture (CGRA) empowers the rapid advancement of AI and machine learning applications. These streaming applications consist of numerous vector operations that can be naturally parallelized. Despite the outstanding achievements of today's hardware accelerators, their potential is limited by their instruction set design. Traditional instruction sets, designed for microprocessors and accelerators, focus on computation and pay little attention to instruction composability and instruction-level cooperation. It leads to a rigid instruction set that is difficult to extend and significant control overhead in hardware. This paper presents an instruction set that is composable in both spatial and temporal sense and suitable for streaming applications. The proposed instruction set contains significantly fewer instruction types but can still efficiently implement complex multi-level loop structures, which is essential for accelerating streaming applications. It is also a resource-centric instruction set that can be conveniently extended by adding new hardware resources, thus creating a custom heterogeneous computation machine. Besides presenting the composable instruction set, we propose a simple yet efficient instruction scheduling algorithm. We analyzed the scalability of the scheduling algorithm and compared the efficiency of our compiled programs against RISC-V programs. The results indicate that our scheduling algorithm scales linearly, and our instruction set leads to near-optimal execution latency. The mapped applications on CIS are nearly 10 times faster than the RISC-V version.
翻译:单指令多数据(SIMD)和粗粒度可重构架构(CGRA)等硬件加速器的效率提升,有力推动了人工智能与机器学习应用的快速发展。此类流式应用包含大量可自然并行化的向量运算。尽管当今硬件加速器已取得显著成就,但其潜力受限于指令集设计。传统为微处理器和加速器设计的指令集侧重于计算功能,却极少关注指令的可组合性与指令级协作,导致指令集僵化难以扩展,并产生显著的硬件控制开销。本文提出一种在空间与时间维度皆具可组合性、适用于流式应用的指令集。该指令集虽包含的指令类型显著减少,仍能高效实现复杂的多级循环结构——这对加速流式应用至关重要。它同时是一种以资源为中心的指令集,可通过添加新硬件资源便捷扩展,从而构建定制的异构计算系统。除提出可组合指令集外,我们设计了一种简洁高效的指令调度算法。我们分析了该调度算法的可扩展性,并将编译后程序与RISC-V程序进行效率对比。结果表明:我们的调度算法具有线性扩展特性,且指令集能实现接近最优的执行延迟。基于CIS映射的应用运行速度较RISC-V版本提升近10倍。