ACALSim: A Scalable Parallel Simulation Framework for High-Performance System Design Space Exploration

Architectural simulation has become the critical bottleneck limiting design space exploration for high-performance computing systems. Modern GPUs and AI accelerators -- with hundreds to thousands of tightly-coupled components -- demand simulation frameworks that deliver efficient parallelism and scalable single-node execution. Existing frameworks fall short: SST focuses on multi-node MPI scalability but struggles with intra-node scaling, while GPGPU-Sim remains largely single-threaded. Critically, none expose a mechanism for users to optimize threading for their specific workloads. We introduce ACALSim, a scalable parallel simulation framework providing infrastructure and APIs for building high-performance simulators -- timing-model accuracy remains the responsibility of simulator developers. Its key innovation is a pluggable thread-management architecture that lets developers implement custom scheduling strategies tailored to specific simulation patterns, absent in existing frameworks. Complementing it are (1) event-driven execution with fast-forward to eliminate idle-cycle overhead, (2) a shared-memory data model enabling zero-copy communication, and (3) a two-phase parallel execution model for deterministic thread scaling. We demonstrate ACALSim through HPCSim, a GPU simulator targeting A100-class architectures. Against an SST implementation using identical shared timing cores to isolate framework overhead, ACALSim achieves over 14x speedup with 41% lower memory footprint; hardware validation confirms 0.72--1.22x cycle-count correlation with A100 measurements. While SST fails to complete 256+ thread-block workloads within practical time limits, ACALSim simulates full LLaMA transformer layers (single block) in 17.7 minutes for LLaMA-7B and 30.4 minutes for LLaMA-13B -- enabling design space exploration that SST cannot achieve.

翻译：体系结构仿真已成为限制高性能计算系统设计空间探索的关键瓶颈。现代GPU与AI加速器——包含数百至数千个紧密耦合的组件——要求仿真框架具备高效并行性与可扩展的单节点执行能力。现有框架存在不足：SST侧重于多节点MPI可扩展性，但难以实现节点内扩展；而GPGPU-Sim仍主要采用单线程执行。关键在于，这些框架均未提供让用户针对特定工作负载优化线程编排的机制。我们提出ACALSim——一个可扩展的并行仿真框架，为构建高性能仿真器提供基础设施与应用程序接口（时序模型精度仍由仿真器开发者负责）。其核心创新在于可插拔的线程管理架构，允许开发者针对特定仿真模式实现自定义调度策略——这是现有框架所缺失的能力。配套机制包括：(1) 结合快速转发的事件驱动执行，消除空闲周期开销；(2) 支持零拷贝通信的共享内存数据模型；(3) 实现确定性线程扩展的两阶段并行执行模型。我们通过面向A100级架构的GPU仿真器HPCSim对ACALSim进行验证。与采用相同共享时序核心以隔离框架开销的SST实现相比，ACALSim实现超过14倍加速比，内存占用降低41%；硬件验证表明其与A100实测结果的周期计数相关性为0.72-1.22倍。当SST无法在合理时间内完成包含256个以上线程块的工作负载时，ACALSim可在17.7分钟（LLaMA-7B）和30.4分钟（LLaMA-13B）内仿真完整LLaMA Transformer层（单块）——实现了SST无法企及的设计空间探索能力。