The increasing complexity and diversity of hardware accelerators in modern computing systems demand flexible, low-overhead program analysis tools. We present PASTA, a low-overhead and modular Program AnalysiS Tool Framework for Accelerators. PASTA abstracts over low-level profiling APIs and diverse deep learning frameworks, offering users a unified interface to capture and analyze runtime events at multiple levels. Its extensible design enables researchers and practitioners to rapidly prototype custom tools with minimal overhead. We demonstrate the utility of PASTA by developing several analysis tools, including a deep learning workload characterization tool and a UVM optimization tool. Through extensive evaluation on mainstream deep learning workloads tested on NVIDIA and AMD GPUs under both single- and multi-GPU scenarios, we demonstrate PASTA's broad applicability. On NVIDIA GPUs, we further show that PASTA provides detailed performance insights with significantly lower overhead, up to 1.3*10^4 faster than conventional analysis tools, thanks to its GPU-accelerated backend. PASTA strikes a practical balance between usability, extensibility, and efficiency, making it well-suited for modern accelerator-based computing environments.
翻译:现代计算系统中硬件加速器日益增长的复杂性和多样性,对灵活、低开销的程序分析工具提出了迫切需求。本文提出PASTA,一个面向加速器的低开销、模块化程序分析工具框架。PASTA对底层性能剖析API和多样化的深度学习框架进行了抽象,为用户提供了一个统一的接口,用以捕获和分析多层次的运行时事件。其可扩展的设计使研究人员和从业者能够以最小开销快速构建自定义工具原型。我们通过开发多个分析工具(包括深度学习工作负载特征分析工具和UVM优化工具)展示了PASTA的实用性。通过在单GPU和多GPU场景下对NVIDIA和AMD GPU上主流深度学习工作负载进行广泛评估,我们证明了PASTA具有广泛的适用性。在NVIDIA GPU上,我们进一步表明,得益于其GPU加速的后端,PASTA能以显著更低的开销(比传统分析工具快达1.3*10^4倍)提供详细的性能洞察。PASTA在易用性、可扩展性和效率之间取得了实用的平衡,使其非常适用于现代基于加速器的计算环境。