Scaling laws for Large Language Models (LLMs) establish that model quality improves with computational scale, yet edge deployment imposes strict constraints on compute, memory, and power. Since General Matrix Multiplication (GEMM) accounts for up to 90\% of inference time, efficient GEMM acceleration is critical for edge AI. The Adaptive Intelligent Engines available in the AMD Versal adaptive SoCs are well suited for this task, but existing state-of-the-art (SOTA) frameworks maximize performance through spatial scaling, distributing workloads across hundreds of cores -- an approach that fails on resource-limited edge SoCs due to physical implementation failures, bandwidth saturation, and excessive resource consumption. We propose Tempus, a Resource-Invariant Temporal GEMM framework for the AMD Versal AI Edge SoC. Rather than expanding hardware resources with matrix size, Tempus employs a fixed compute block of 16 AIE-ML cores, achieving scalability through iterative graph execution and algorithmic data tiling and replication in the Programmable Logic. High-speed cascade streaming ensures low-latency partial sum reduction at Initiation Interval (II) of 1, while a deadlock-free DATAFLOW protocol maximizes transfer-compute overlap and PLIO reuse. Evaluated on GEMM workloads, Tempus achieves 607 GOPS at 10.677 W total on-chip power. By characterizing system-level efficiency through the Platform-Aware Utility (PAU) metric, we prove that Tempus achieves a 211.2x higher prominence factor than the leading spatial SOTA (ARIES). Furthermore, the framework maintains a 0.00\% utilization of URAM/DSP, yielding 22.0x core frugality, 7.1x power frugality, and a 6.3x reduction in I/O demand, establishing a sustainable, scalable foundation for edge LLM inference.
翻译:大语言模型(LLMs)的缩放定律表明模型质量随计算规模提升而提高,然而边缘部署对计算、内存和功耗施加了严格约束。由于通用矩阵乘法(GEMM)占推理时间的90%,高效的GEMM加速对边缘AI至关重要。AMD Versal自适应SoC中的自适应智能引擎非常适合此任务,但现有最先进(SOTA)框架通过空间缩放实现性能最大化,将工作负载分布在数百个核心上——这种因物理实现失败、带宽饱和及资源过度消耗,在资源受限的边缘SoC上难以奏效。我们提出Tempus,一种面向AMD Versal AI Edge SoC的资源不变时间GEMM框架。Tempus不随矩阵尺寸扩展硬件资源,而是采用固定计算块(16个AIE-ML核心),通过可编程逻辑中的迭代图执行和算法数据分块与复制实现可扩展性。高速级联流式传输在启动间隔(II)为1时实现低延迟部分和归约,无死锁DATAFLOW协议最大化传输-计算重叠和PLIO复用。在GEMM工作负载评估中,Tempus在总片上功耗10.677W下达到607 GOPS。通过平台感知效用(PAU)指标表征系统级效率,我们证明Tempus相较于领先的空间SOTA(ARIES)获得211.2倍更高的显著因子。此外,该框架的URAM/DSP利用率为0.00%,实现22.0倍核心节俭度、7.1倍功耗节俭度及6.3倍I/O需求降低,为边缘LLM推理建立可持续、可扩展的基础。