Recent studies have extensively explored NPU architectures for accelerating AI inference in on-device environments, which are inherently resource-constrained. Meanwhile, transformer-based large language models (LLMs) have become dominant, with rapidly increasing model sizes but low degree of parameter reuse compared to conventional CNNs, making end-to-end execution on resource-limited devices extremely challenging. To address these challenges, we propose TriGen, a novel NPU architecture tailored for resource-constrained environments through software-hardware co-design. Firstly, TriGen adopts low-precision computation using microscaling (MX) to enable additional optimization opportunities while preserving accuracy, and resolves the issues that arise by employing such precision. Secondly, to jointly optimize both nonlinear and linear operations, TriGen eliminates the need for specialized hardware for essential nonlinear operations by using fast and accurate LUT, thereby maximizing performance gains and reducing hardware-cost in on-device environments, and finally, by taking practical hardware constraints into account, further employs scheduling techniques to maximize computational utilization even under limited on-chip memory capacity. We evaluate the performance of TriGen on various LLMs and show that TriGen achieves an average 2.73x performance speedup and 52% less memory transfer over the baseline NPU design with negligible accuracy loss.
翻译:近期研究广泛探索了用于在资源受限的片上环境中加速AI推理的NPU架构。与此同时,基于Transformer的大语言模型已成为主流,其模型规模快速增长,但与传统的CNN相比参数复用度较低,这使得在资源受限设备上实现端到端执行极具挑战性。为应对这些挑战,我们提出TriGen,一种通过软硬件协同设计为资源受限环境定制的新型NPU架构。首先,TriGen采用基于微缩放的低精度计算,在保持精度的同时提供额外的优化机会,并解决了采用此类精度所引发的问题。其次,为联合优化非线性和线性运算,TriGen通过使用快速准确的查找表,消除了对关键非线性运算专用硬件的需求,从而最大化性能增益并降低片上环境的硬件成本。最后,通过考虑实际硬件约束,进一步采用调度技术,即使在有限的片上存储容量下也能最大化计算利用率。我们在多种大语言模型上评估了TriGen的性能,结果表明,与基线NPU设计相比,TriGen在精度损失可忽略不计的情况下,平均实现了2.73倍的性能加速和52%的内存传输减少。