Building and maintaining large AI fleets to efficiently support the fast-growing DL workloads is an active research topic for modern cloud infrastructure providers. Generating accurate benchmarks plays an essential role in the design and evaluation of rapidly evoloving software and hardware solutions in this area. Two fundamental challenges to make this process scalable are (i) workload representativeness and (ii) the ability to quickly incorporate changes to the fleet into the benchmarks. To overcome these issues, we propose Mystique, an accurate and scalable framework for production AI benchmark generation. It leverages the PyTorch execution graph (EG), a new feature that captures the runtime information of AI models at the granularity of operators, in a graph format, together with their metadata. By sourcing EG traces from the fleet, we can build AI benchmarks that are portable and representative. Mystique is scalable, with its lightweight data collection, in terms of runtime overhead and user instrumentation efforts. It is also adaptive, as the expressiveness and composability of EG format allows flexible user control over benchmark creation. We evaluate our methodology on several production AI workloads, and show that benchmarks generated with Mystique closely resemble original AI models, both in execution time and system-level metrics. We also showcase the portability of the generated benchmarks across platforms, and demonstrate several use cases enabled by the fine-grained composability of the execution graph.
翻译:构建和维护大规模AI集群以高效支持快速增长的深度学习工作负载,已成为现代云基础设施提供商的重要研究课题。生成准确的基准测试在评估该领域快速演进的软硬件解决方案中扮演关键角色。实现可扩展流程面临两大基础挑战:(i)工作负载代表性,以及(ii)快速将集群变更融入基准测试的能力。为解决这些问题,我们提出神秘之钥(Mystique),一个精确且可扩展的生产级AI基准测试生成框架。该框架利用PyTorch执行图(EG)——这一新特性能以算子粒度、结合元数据以图形格式捕获AI模型的运行时信息。通过从集群中采集EG追踪数据,我们能够构建可移植且具有代表性的AI基准测试。在运行时开销和用户工具集成方面,Mystique凭借轻量级数据收集实现可扩展性;同时由于EG格式的表达力和可组合性,用户可灵活控制基准测试的创建过程,使其具备自适应性。我们在多个生产级AI工作负载上评估该方法,结果显示Mystique生成的基准测试在执行时间和系统级指标上均与原始AI模型高度相似。我们还展示了所生成基准测试的跨平台可移植性,并演示了执行图细粒度可组合性支撑的多个应用场景。