The performance of AI accelerators is increasingly limited by data movement, memory access, and orchestration overheads rather than raw compute capability. This paper presents MAVeC, a messaging-based adaptive vector computing accelerator designed to support streaming execution and runtime configurability for AI workloads. MAVeC replaces centralized control with a message-driven execution model in which data and control propagate together across distributed hardware elements, enabling autonomous execution, flexible routing, and efficient coordination. We validate MAVeC's core hardware constructs and execution model using matrix multiplication and convolution workloads under a cycle-accurate, system-level ASIC design in TSMC 28 nm, capturing computation, communication, and reduction. MAVeC sustains greater than 97 percent array utilization across hardware scales and problem sizes by translating spatial capacity into effective computation. Once inputs are brought in, over 90 percent of communication remains on-chip through coordinated temporal reuse, spatial multicast, and on-fabric partial-sum reduction. On a 64x64 SiteO array, MAVeC sustains over 5 TFLOPs per second while reducing end-to-end latency. Compared to TPU-style systolic arrays and MEISSA under compute-centric models, MAVeC achieves 1.5-2x lower latency. When evaluated against optimized NVIDIA H100 FP32 kernels, MAVeC sustains 5.8-6.1 TFLOPs per second, delivering a consistent 6.0-7.2x throughput advantage across problem sizes. Energy results show that MAVeC converts higher instantaneous power into lower total energy by shortening execution time and amortizing data movement. These results demonstrate that message-driven execution provides an effective architectural foundation for overcoming data movement and orchestration bottlenecks, enabling scalable, high-utilization accelerators for future AI workloads.
翻译:AI加速器的性能日益受到数据移动、内存访问和调度开销的限制,而非原始计算能力。本文提出MAVeC,一种基于消息传递的自适应向量计算加速器,旨在支持AI工作负载的流式执行和运行时可配置性。MAVeC采用消息驱动的执行模型取代集中式控制,使数据与控制共同在分布式硬件单元间传播,从而实现自主执行、灵活路由和高效协同。我们通过矩阵乘法和卷积工作负载,在TSMC 28 nm工艺的周期精确系统级ASIC设计下验证了MAVeC的核心硬件架构与执行模型,完整捕获了计算、通信和规约过程。MAVeC通过将空间容量转化为有效计算,在不同硬件规模和问题尺寸下均保持超过97%的阵列利用率。一旦输入数据载入,超过90%的通信通过协调的时间复用、空间组播和片上部分和规约保持在芯片内部。在64x64 SiteO阵列上,MAVeC持续实现每秒超过5 TFLOPs的算力,同时降低端到端延迟。相较于以计算为中心的TPU式脉动阵列和MEISSA架构,MAVeC实现了1.5-2倍的延迟降低。与优化的NVIDIA H100 FP32核函数相比,MAVeC持续保持每秒5.8-6.1 TFLOPs的算力,在不同问题尺寸下均提供6.0-7.2倍的稳定吞吐量优势。能效结果表明,MAVeC通过缩短执行时间并分摊数据移动开销,将更高的瞬时功耗转化为更低的总能耗。这些结果证明,消息驱动执行为克服数据移动与调度瓶颈提供了有效的架构基础,能够为未来AI工作负载构建可扩展、高利用率的加速器。