Forecasting GPU Performance for Deep Learning Training and Inference

Deep learning kernels exhibit predictable memory accesses and compute patterns, making GPUs' parallel architecture well-suited for their execution. Software and runtime systems for GPUs are optimized to better utilize the stream multiprocessors, on-chip cache, and off-chip high-bandwidth memory. As deep learning models and GPUs evolve, access to newer GPUs is often limited, raising questions about the performance of new model architectures on existing GPUs, existing models on new GPUs, and new model architectures on new GPUs. To address these questions, we introduce NeuSight, a framework to predict the performance of various deep learning models, for both training and inference, on unseen GPUs without requiring actual execution. The framework leverages both GPU hardware behavior and software library optimizations to estimate end-to-end performance. Previous work uses regression models that capture linear trends or multilayer perceptrons to predict the overall latency of deep learning kernels on GPUs. These approaches suffer from higher error percentages when forecasting performance on unseen models and new GPUs. Instead, NeuSight decomposes the prediction problem into smaller problems, bounding the prediction through fundamental performance laws. NeuSight decomposes a single deep learning kernel prediction into smaller working sets called tiles, which are executed independently on the GPU. Tile-granularity predictions are determined using a machine learning approach and aggregated to estimate end-to-end latency. NeuSight outperforms prior work across various deep learning workloads and the latest GPUs. It reduces the percentage error from 121.4% and 30.8% to 2.3% in predicting the latency of GPT3 model for training and inference on H100, compared to state-of-the-art prior work, where both GPT3 and H100 were not used to train the framework.

翻译：深度学习内核展现出可预测的内存访问与计算模式，使得GPU的并行架构非常适合其执行。GPU的软件与运行时系统经过优化，以更好地利用流多处理器、片上缓存和片外高带宽内存。随着深度学习模型与GPU的不断发展，获取新型GPU的机会往往有限，这引发了关于新模型架构在现有GPU上、现有模型在新GPU上以及新模型架构在新GPU上性能表现的疑问。为解答这些问题，我们提出了NeuSight框架，该框架能够在无需实际执行的情况下，预测各类深度学习模型在未见过的GPU上进行训练和推理的性能。该框架同时利用GPU硬件行为与软件库优化来估算端到端性能。先前的研究使用捕捉线性趋势的回归模型或多层感知器来预测深度学习内核在GPU上的总体延迟。这些方法在预测未见模型与新GPU的性能时存在较高的误差百分比。相比之下，NeuSight将预测问题分解为若干子问题，并通过基本性能定律对预测进行约束。NeuSight将单个深度学习内核的预测分解为称为tile的较小工作集，这些工作集在GPU上独立执行。Tile粒度的预测通过机器学习方法确定，并聚合以估算端到端延迟。NeuSight在多种深度学习工作负载及最新GPU上均优于先前工作。在预测GPT3模型于H100上进行训练和推理的延迟时，相较于最先进的先前工作（该框架的训练未使用GPT3与H100），它将误差百分比从121.4%和30.8%降低至2.3%。