Precise hardware performance models play a crucial role in code optimizations. They can assist compilers in making heuristic decisions or aid autotuners in identifying the optimal configuration for a given program. For example, the autotuner for XLA, a machine learning compiler, discovered 10-20% speedup on state-of-the-art models serving substantial production traffic at Google. Although there exist a few datasets for program performance prediction, they target small sub-programs such as basic blocks or kernels. This paper introduces TpuGraphs, a performance prediction dataset on full tensor programs, represented as computational graphs, running on Tensor Processing Units (TPUs). Each graph in the dataset represents the main computation of a machine learning workload, e.g., a training epoch or an inference step. Each data sample contains a computational graph, a compilation configuration, and the execution time of the graph when compiled with the configuration. The graphs in the dataset are collected from open-source machine learning programs, featuring popular model architectures, e.g., ResNet, EfficientNet, Mask R-CNN, and Transformer. TpuGraphs provides 25x more graphs than the largest graph property prediction dataset (with comparable graph sizes), and 770x larger graphs on average compared to existing performance prediction datasets on machine learning programs. This graph-level prediction task on large graphs introduces new challenges in learning, ranging from scalability, training efficiency, to model quality.
翻译:精确的硬件性能模型在代码优化中扮演着关键角色。它们能够辅助编译器做出启发式决策,或帮助自动调优器为给定程序识别最优配置。例如,机器学习编译器XLA的自动调优器在谷歌处理大量生产流量的先进模型上发现了10%-20%的加速效果。尽管现有少量面向程序性能预测的数据集,但这些数据集主要针对基本块或内核等小型子程序。本文介绍TpuGraphs——一个面向完整张量程序的性能预测数据集,其数据表示为在张量处理单元(TPU)上运行的计算图。数据集中的每个图代表机器学习工作负载的主要计算过程(例如一个训练轮次或推理步骤)。每个数据样本包含计算图、编译配置以及在该配置下编译后图的执行时间。数据集的图从开源机器学习程序中收集,涵盖ResNet、EfficientNet、Mask R-CNN和Transformer等主流模型架构。与规模最大的图属性预测数据集(图规模相当)相比,TpuGraphs提供的图数量是其25倍;与现有机器学习程序性能预测数据集相比,其图规模平均大770倍。这种针对大图级别的预测任务在学习过程中引入了新的挑战,涉及可扩展性、训练效率以及模型质量等多个方面。