Building large AI fleets to support the rapidly growing DL workloads is an active research topic for modern cloud providers. Generating accurate benchmarks plays an essential role in designing the fast-paced software and hardware solutions in this space. Two fundamental challenges to make this scalable are (i) workload representativeness and (ii) the ability to quickly incorporate changes to the fleet into the benchmarks. To overcome these issues, we propose Mystique, an accurate and scalable framework for production AI benchmark generation. It leverages the PyTorch execution trace (ET), a new feature that captures the runtime information of AI models at the granularity of operators, in a graph format, together with their metadata. By sourcing fleet ETs, we can build AI benchmarks that are portable and representative. Mystique is scalable, due to its lightweight data collection, in terms of runtime overhead and instrumentation effort. It is also adaptive because ET composability allows flexible control on benchmark creation. We evaluate our methodology on several production AI models, and show that benchmarks generated with Mystique closely resemble original AI models, both in execution time and system-level metrics. We also showcase the portability of the generated benchmarks across platforms, and demonstrate several use cases enabled by the fine-grained composability of the execution trace.
翻译:支持快速增长深度学习工作负载的大型AI集群构建,是现代云服务提供商的重要研究课题。生成精准的基准测试在此领域的软硬件快速迭代设计中发挥着关键作用。实现可扩展性面临两大核心挑战:(i) 负载代表性;(ii) 快速将集群变更融入基准测试的能力。为解决这些问题,我们提出Mystique——一种面向生产级AI基准测试生成的精准可扩展框架。该框架利用PyTorch执行追踪(ET)这一新特性,以算子粒度捕获AI模型的运行时信息(以图结构呈现)及其元数据。通过采集集群执行追踪数据,我们可构建兼具可移植性与代表性的AI基准测试。得益于轻量级数据采集(运行时开销小、工具集成成本低),Mystique具备高度可扩展性;同时,由于执行追踪的可组合性允许灵活控制基准测试生成,该框架还展现出良好的适应性。我们在多个生产级AI模型上评估了该方法,结果表明:经Mystique生成的基准测试在执行时间与系统级指标上均与原始AI模型高度吻合。此外,我们还验证了生成基准测试的跨平台可移植性,并展示了执行追踪细粒度可组合性所支撑的多种应用场景。