Cycle-accurate simulators are widely used to study systolic accelerators, yet their accuracy and usability are often limited by weak validation against real hardware and poor integration with modern ML compiler stacks. This paper presents SCALE-Sim TPU, a validated and extended version of SCALE-Sim v3 for TPU-style accelerators. Specifically, we make three contributions: (1) We validate SCALE-Sim's systolic GEMM model against measurements on Google TPU v4 and show that simulated cycle counts exhibit a strong linear correlation with hardware latency, enabling a simple cycle-to-latency mapping. (2) We introduce lightweight learned latency models for non-systolic elementwise operations, achieving median relative errors below 3 percent using only tensor size and shape, substantially improving end-to-end latency estimation. (3) We integrate a StableHLO-based frontend that allows workloads from modern ML frameworks such as JAX and PyTorch to be simulated directly via a unified compiler IR. Together, these contributions improve the fidelity, coverage, and practicality of cycle-accurate simulation for whole-model performance analysis on TPUs.
翻译:周期精确模拟器广泛用于研究脉动阵列加速器,但其准确性及可用性常因缺乏对真实硬件的充分验证以及与当代ML编译器栈的弱集成而受限。本文提出SCALE-Sim TPU,这是针对TPU类加速器经过验证且扩展的SCALE-Sim v3版本。具体而言,我们做出三项贡献:(1)我们根据Google TPU v4上的测量值验证了SCALE-Sim的脉动GEMM模型,并表明模拟周期计数与硬件延迟呈现强线性相关性,从而能够实现简单的周期到延迟映射。(2)我们针对非脉动逐元素操作引入了轻量级学习延迟模型,仅使用张量大小和形状即可实现中位相对误差低于3%,显著提升了端到端延迟估算精度。(3)我们集成了基于StableHLO的前端,使来自JAX和PyTorch等当代ML框架的工作负载能够通过统一编译器IR直接进行模拟。这些贡献共同提升了TPU上全模型性能分析中周期精确模拟的保真度、覆盖范围与实用性。