Modern Artificial Intelligence (AI) workloads demand computing systems with large silicon area to sustain throughput and competitive performance. However, prohibitive manufacturing costs and yield limitations at advanced tech nodes and die-size reaching the reticle limit restrain us from achieving this. With the recent innovations in advanced packaging technologies, chiplet-based architectures have gained significant attention in the AI hardware domain. However, the vast design space of chiplet-based AI accelerator design and the absence of system and package-level co-design methodology make it difficult for the designer to find the optimum design point regarding Power, Performance, Area, and manufacturing Cost (PPAC). This paper presents Chiplet-Gym, a Reinforcement Learning (RL)-based optimization framework to explore the vast design space of chiplet-based AI accelerators, encompassing the resource allocation, placement, and packaging architecture. We analytically model the PPAC of the chiplet-based AI accelerator and integrate it into an OpenAI gym environment to evaluate the design points. We also explore non-RL-based optimization approaches and combine these two approaches to ensure the robustness of the optimizer. The optimizer-suggested design point achieves 1.52X throughput, 0.27X energy, and 0.01X die cost while incurring only 1.62X package cost of its monolithic counterpart at iso-area.
翻译:现代人工智能(AI)工作负载需要具备大硅片面积的计算系统以维持吞吐量和具有竞争力的性能。然而,先进工艺节点下高昂的制造成本与良率限制,以及芯片尺寸接近光罩极限的现实,阻碍了这一目标的实现。随着先进封装技术的最新创新,基于芯粒(chiplet)的架构在AI硬件领域获得了广泛关注。然而,基于芯粒的AI加速器设计空间巨大,且缺乏系统与封装级的协同设计方法,使得设计者难以在功耗、性能、面积和制造成本(PPAC)方面找到最优设计点。本文提出Chiplet-Gym,一个基于强化学习(RL)的优化框架,用于探索基于芯粒的AI加速器的广阔设计空间,涵盖资源分配、布局和封装架构。我们通过分析建模建立了基于芯粒的AI加速器的PPAC模型,并将其集成到OpenAI Gym环境中以评估设计点。我们还探索了非基于强化学习的优化方法,并将这两种方法结合以确保优化器的鲁棒性。在等面积条件下,优化器建议的设计点相较于单片式设计,实现了1.52倍的吞吐量、0.27倍的能耗和0.01倍的芯片成本,而封装成本仅为后者的1.62倍。