Modern large language model workloads put increasing demands on parallel compute capability and on-chip memory capacity, while also stressing fine-grained data movement and synchronization. These trends motivate exploring and designing many-core accelerators with tightly coupled scratchpad memory (SPM) for scalable compute and predictable, explicitly managed data access. However, this architectural shift raises two challenges: cycle-accurate register-transfer level (RTL) simulation becomes prohibitively slow as system complexity grows, and performance estimation requires precise modeling of latency-sensitive interconnect behavior. This paper presents a fast yet accurate end-to-end modeling approach for latency-sensitive many-core architectures, targeting large-scale instances such as TeraNoC with 1024 cores and a 4MiB globally shared L1 SPM. The approach captures timing behavior of latency-sensitive SPM accesses across multiple interconnect scales, while abstracting non-essential hardware details. Across diverse benchmarks, the model tracks a cycle-accurate RTL golden model with errors below 7%, while delivering up to 115x faster simulation. The framework also provides detailed profiling across processing elements and interconnect, enabling efficient end-to-end software development and hardware design exploration. Two case studies demonstrate its practicality: profiling-guided optimization of FlashAttention-2 to reduce interconnect stalls and synchronization overhead, and design space exploration of network-on-chip (NoC) router remapping to alleviate traffic imbalance and improve throughput.
翻译:现代大语言模型工作负载对并行计算能力与片上存储容量提出了日益增长的需求,同时加剧了对细粒度数据移动与同步的挑战。这些趋势促使人们探索和设计具有紧耦合便笺存储器(SPM)的众核加速器,以实现可扩展计算与可预测的显式数据访问管理。然而,这一架构转变引发了两个难题:随着系统复杂性增长,周期精确的寄存器传输级(RTL)仿真变得极其缓慢;性能评估需要精确建模延迟敏感的互连行为。本文提出了一种针对延迟敏感型众核架构的快速且精准的端到端建模方法,面向TeraNoC等大规模实例(含1024个核心与4MiB全局共享一级便笺存储器)。该方法能够捕捉跨多级互连的延迟敏感型SPM访问的时序行为,同时抽象化非关键硬件细节。在多样化基准测试中,该模型追踪周期精确的RTL黄金模型的误差低于7%,同时实现了最高达115倍的仿真加速。框架还提供了跨处理元件与互连的详细性能剖析功能,支持高效的端到端软件开发与硬件设计探索。两项案例研究证明了其实用性:通过剖析指导FlashAttention-2优化以减少互连停顿与同步开销,以及通过设计空间探索对片上网络(NoC)路由器进行重映射以缓解流量不均并提升吞吐量。