With the cross-fertilization of applications and the ever-increasing scale of models, the efficiency and productivity of hardware computing architectures have become inadequate. This inadequacy further exacerbates issues in design flexibility, design complexity, development cycle, and development costs (4-d problems) in divergent scenarios. To address these challenges, this paper proposed a flexible design flow called DIAG based on plugin techniques. The proposed flow guides hardware development through four layers: definition(D), implementation(I), application(A), and generation(G). Furthermore, a versatile CGRA generator called WindMill is implemented, allowing for agile generation of customized hardware accelerators based on specific application demands. Applications and algorithm tasks from three aspects is experimented. In the case of reinforcement learning algorithm, a significant performance improvement of $2.3\times$ compared to GPU is achieved.
翻译:随着应用领域的交叉融合与模型规模的持续增长,硬件计算架构的效率和生产力已显不足。这种不足进一步加剧了异构场景中设计灵活性、设计复杂度、开发周期与开发成本(4-D问题)的挑战。针对上述问题,本文提出一种基于插件技术的灵活设计流程DIAG。该流程通过定义(D)、实现(I)、应用(A)与生成(G)四个层次引导硬件开发。此外,本文实现了一种名为WindMill的通用CGRA生成器,能够根据特定应用需求敏捷生成定制化硬件加速器。从三个层面进行应用与算法任务的实验验证。在强化学习算法案例中,相较GPU实现了$2.3\times$的显著性能提升。