Tempus: A Temporally Scalable Resource-Invariant GEMM Streaming Framework for Versal AI Edge

Scaling laws for Large Language Models (LLMs) establish that model quality improves with computational scale, yet edge deployment imposes strict constraints on compute, memory, and power. Since General Matrix Multiplication (GEMM) accounts for up to 90\% of inference time, efficient GEMM acceleration is critical for edge AI. The Adaptive Intelligent Engines available in the AMD Versal adaptive SoCs are well suited for this task, but existing state-of-the-art (SOTA) frameworks maximize performance through spatial scaling, distributing workloads across hundreds of cores -- an approach that fails on resource-limited edge SoCs due to physical implementation failures, bandwidth saturation, and excessive resource consumption. We propose Tempus, a Resource-Invariant Temporal GEMM framework for the AMD Versal AI Edge SoC. Rather than expanding hardware resources with matrix size, Tempus employs a fixed compute block of 16 AIE-ML cores, achieving scalability through iterative graph execution and algorithmic data tiling and replication in the Programmable Logic. High-speed cascade streaming ensures low-latency partial sum reduction at Initiation Interval (II) of 1, while a deadlock-free DATAFLOW protocol maximizes transfer-compute overlap and PLIO reuse. Evaluated on GEMM workloads, Tempus achieves 607 GOPS at 10.677 W total on-chip power. By characterizing system-level efficiency through the Platform-Aware Utility (PAU) metric, we prove that Tempus achieves a 211.2x higher prominence factor than the leading spatial SOTA (ARIES). Furthermore, the framework maintains a 0.00\% utilization of URAM/DSP, yielding 22.0x core frugality, 7.1x power frugality, and a 6.3x reduction in I/O demand, establishing a sustainable, scalable foundation for edge LLM inference.

翻译：大语言模型（LLMs）的缩放定律表明模型质量随计算规模提升而提高，然而边缘部署对计算、内存和功耗施加了严格约束。由于通用矩阵乘法（GEMM）占推理时间的90%，高效的GEMM加速对边缘AI至关重要。AMD Versal自适应SoC中的自适应智能引擎非常适合此任务，但现有最先进（SOTA）框架通过空间缩放实现性能最大化，将工作负载分布在数百个核心上——这种因物理实现失败、带宽饱和及资源过度消耗，在资源受限的边缘SoC上难以奏效。我们提出Tempus，一种面向AMD Versal AI Edge SoC的资源不变时间GEMM框架。Tempus不随矩阵尺寸扩展硬件资源，而是采用固定计算块（16个AIE-ML核心），通过可编程逻辑中的迭代图执行和算法数据分块与复制实现可扩展性。高速级联流式传输在启动间隔（II）为1时实现低延迟部分和归约，无死锁DATAFLOW协议最大化传输-计算重叠和PLIO复用。在GEMM工作负载评估中，Tempus在总片上功耗10.677W下达到607 GOPS。通过平台感知效用（PAU）指标表征系统级效率，我们证明Tempus相较于领先的空间SOTA（ARIES）获得211.2倍更高的显著因子。此外，该框架的URAM/DSP利用率为0.00%，实现22.0倍核心节俭度、7.1倍功耗节俭度及6.3倍I/O需求降低，为边缘LLM推理建立可持续、可扩展的基础。

相关内容

关注 7106

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

【AAAI2026】NeSTR：一种用于大型语言模型的神经-符号可溯因框架，用于时间推理

专知会员服务

17+阅读 · 2025年12月10日

《探索军事背景下共享大语言模型：AI助手与智能体部署中可扩展性与效率的早期洞察》（含44页slides）

专知会员服务

22+阅读 · 2025年10月31日

【ICML2025】MetaAgent：基于有限状态机的多智能体系统自动构建方法

专知会员服务

15+阅读 · 2025年7月31日

【NeurIPS2024】《AmoebaLLM：构建任意形状的大型语言模型以实现高效和即时部署》

专知会员服务

22+阅读 · 2024年11月21日