Tempus: A Temporally Scalable Resource-Invariant GEMM Streaming Framework for Versal AI Edge

Scaling laws for Large Language Models (LLMs) establish that model quality improves with computational scale, yet edge deployment imposes strict constraints on compute, memory, and power. Since General Matrix Multiplication (GEMM) accounts for up to 90% of inference time, efficient GEMM acceleration is critical for edge AI. The Adaptive Intelligent Engines available in the AMD Versal adaptive SoCs are well suited for this task, but existing state-of-the-art (SOTA) frameworks maximize performance through spatial scaling, distributing workloads across hundreds of cores -- an approach that fails on resource-limited edge SoCs due to physical implementation failures, bandwidth saturation, and excessive resource consumption. We propose Tempus, a Resource-Invariant Temporal GEMM framework for the AMD Versal AI Edge SoC. Rather than expanding hardware resources with matrix size, Tempus employs a fixed compute block of 16 AIE-ML cores, achieving scalability through iterative graph execution and algorithmic data tiling and replication in the Programmable Logic. High-speed cascade streaming ensures low-latency partial sum reduction at Initiation Interval (II) of 1, while a deadlock-free DATAFLOW protocol maximizes transfer-compute overlap and PLIO reuse. Evaluated on GEMM workloads, Tempus achieves 607 GOPS at 10.677 W total on-chip power. By characterizing system-level efficiency through the Platform-Aware Utility (PAU) metric, we prove that Tempus achieves a 211.2x higher prominence factor than the leading spatial SOTA (ARIES). Furthermore, the framework maintains a 0.00% utilization of URAM/DSP, yielding 22.0x core frugality, 7.1x power frugality, and a 6.3x reduction in I/O demand, establishing a sustainable, scalable foundation for edge LLM inference.

翻译：大语言模型（LLM）的扩展律表明模型质量随计算规模提升，但边缘部署对计算、内存和功耗施加了严格约束。由于通用矩阵乘法（GEMM）占推理时间的90%，高效的GEMM加速对边缘AI至关重要。AMD Versal自适应SoC中的自适应智能引擎非常适合此任务，但现有最先进框架通过空间扩展最大化性能，将工作负载分布到数百个核心上——这种方法因物理实现失败、带宽饱和及过度资源消耗，在资源受限的边缘SoC上无法奏效。我们提出Tempus——面向AMD Versal AI Edge SoC的资源不变时间GEMM框架。不同于随矩阵规模扩展硬件资源，Tempus采用固定16个AIE-ML核心的计算模块，通过可编程逻辑中的迭代图执行与算法数据的平铺复制实现可扩展性。高速级联流式传输确保在启动间隔（II）为1时实现低延迟部分和归约，而无死锁的DATAFLOW协议最大化传输-计算重叠与PLIO复用。在GEMM工作负载上评估，Tempus在10.677 W片内总功耗下达到607 GOPS。通过平台感知效用（PAU）指标表征系统级效率，我们证明Tempus相较于领先空间最先进方案（ARIES）实现211.2倍更高的突出因子。此外，该框架保持URAM/DSP利用率0.00%，实现22.0倍核心节约、7.1倍功耗节约和6.3倍I/O需求削减，为边缘LLM推理建立可持续可扩展基础。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

【AAAI2026】NeSTR：一种用于大型语言模型的神经-符号可溯因框架，用于时间推理

专知会员服务

17+阅读 · 2025年12月10日

《探索军事背景下共享大语言模型：AI助手与智能体部署中可扩展性与效率的早期洞察》（含44页slides）

专知会员服务

23+阅读 · 2025年10月31日

《面向边缘AI应用的高性能高能效架构探索》156页

专知会员服务

37+阅读 · 2025年4月12日

【NeurIPS2024】《AmoebaLLM：构建任意形状的大型语言模型以实现高效和即时部署》

专知会员服务

22+阅读 · 2024年11月21日