Accelerating Precise End-to-End Simulation: Latency-Sensitive Many-core System Modeling

Modern large language model workloads put increasing demands on parallel compute capability and on-chip memory capacity, while also stressing fine-grained data movement and synchronization. These trends motivate exploring and designing many-core accelerators with tightly coupled scratchpad memory (SPM) for scalable compute and predictable, explicitly managed data access. However, this architectural shift raises two challenges: cycle-accurate register-transfer level (RTL) simulation becomes prohibitively slow as system complexity grows, and performance estimation requires precise modeling of latency-sensitive interconnect behavior. This paper presents a fast yet accurate end-to-end modeling approach for latency-sensitive many-core architectures, targeting large-scale instances such as TeraNoC with 1024 cores and a 4MiB globally shared L1 SPM. The approach captures timing behavior of latency-sensitive SPM accesses across multiple interconnect scales, while abstracting non-essential hardware details. Across diverse benchmarks, the model tracks a cycle-accurate RTL golden model with errors below 7%, while delivering up to 115x faster simulation. The framework also provides detailed profiling across processing elements and interconnect, enabling efficient end-to-end software development and hardware design exploration. Two case studies demonstrate its practicality: profiling-guided optimization of FlashAttention-2 to reduce interconnect stalls and synchronization overhead, and design space exploration of network-on-chip (NoC) router remapping to alleviate traffic imbalance and improve throughput.

翻译：现代大语言模型工作负载对并行计算能力与片上存储容量提出了日益增长的需求，同时加剧了对细粒度数据移动与同步的挑战。这些趋势促使人们探索和设计具有紧耦合便笺存储器（SPM）的众核加速器，以实现可扩展计算与可预测的显式数据访问管理。然而，这一架构转变引发了两个难题：随着系统复杂性增长，周期精确的寄存器传输级（RTL）仿真变得极其缓慢；性能评估需要精确建模延迟敏感的互连行为。本文提出了一种针对延迟敏感型众核架构的快速且精准的端到端建模方法，面向TeraNoC等大规模实例（含1024个核心与4MiB全局共享一级便笺存储器）。该方法能够捕捉跨多级互连的延迟敏感型SPM访问的时序行为，同时抽象化非关键硬件细节。在多样化基准测试中，该模型追踪周期精确的RTL黄金模型的误差低于7%，同时实现了最高达115倍的仿真加速。框架还提供了跨处理元件与互连的详细性能剖析功能，支持高效的端到端软件开发与硬件设计探索。两项案例研究证明了其实用性：通过剖析指导FlashAttention-2优化以减少互连停顿与同步开销，以及通过设计空间探索对片上网络（NoC）路由器进行重映射以缓解流量不均并提升吞吐量。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《美国太空军系统全生命周期建模、仿真与分析效能提升方案》最新84页报告

专知会员服务

22+阅读 · 2025年8月7日

大型语言模型对齐技术综述：RLHF、RLAIF、PPO、DPO 等

专知会员服务

55+阅读 · 2024年7月24日