Effective performance profiling and analysis are essential for optimizing training and inference of deep learning models, especially given the growing complexity of heterogeneous computing environments. However, existing tools often lack the capability to provide comprehensive program context information and performance optimization insights for sophisticated interactions between CPUs and GPUs. This paper introduces DeepContext, a novel profiler that links program contexts across high-level Python code, deep learning frameworks, underlying libraries written in C/C++, as well as device code executed on GPUs. DeepContext incorporates measurements of both coarse- and fine-grained performance metrics for major deep learning frameworks, such as PyTorch and JAX, and is compatible with GPUs from both Nvidia and AMD, as well as various CPU architectures, including x86 and ARM. In addition, DeepContext integrates a novel GUI that allows users to quickly identify hotpots and an innovative automated performance analyzer that suggests users with potential optimizations based on performance metrics and program context. Through detailed use cases, we demonstrate how DeepContext can help users identify and analyze performance issues to enable quick and effective optimization of deep learning workloads. We believe Deep Context is a valuable tool for users seeking to optimize complex deep learning workflows across multiple compute environments.
翻译:有效的性能剖析与分析对于优化深度学习模型的训练与推理至关重要,尤其是在异构计算环境日益复杂的背景下。然而,现有工具通常缺乏为CPU与GPU之间复杂的交互提供全面的程序上下文信息和性能优化洞察的能力。本文介绍了DeepContext,一种新颖的性能剖析器,它能够将高级Python代码、深度学习框架、底层C/C++库以及在GPU上执行的设备代码之间的程序上下文关联起来。DeepContext整合了对主流深度学习框架(如PyTorch和JAX)的粗粒度和细粒度性能指标的测量,并且兼容Nvidia和AMD的GPU,以及包括x86和ARM在内的多种CPU架构。此外,DeepContext集成了一个新颖的图形用户界面(GUI),允许用户快速识别性能热点,以及一个创新的自动化性能分析器,该分析器能够根据性能指标和程序上下文为用户提供潜在的优化建议。通过详细的用例,我们展示了DeepContext如何帮助用户识别和分析性能问题,从而实现对深度学习工作负载的快速有效优化。我们相信,对于寻求在多种计算环境中优化复杂深度学习工作流的用户而言,DeepContext是一个有价值的工具。