The fast pace of artificial intelligence~(AI) innovation demands an agile methodology for observation, reproduction and optimization of distributed machine learning~(ML) workload behavior in production AI systems and enables efficient software-hardware~(SW-HW) co-design for future systems. We present Chakra, an open and portable ecosystem for performance benchmarking and co-design. The core component of Chakra is an open and interoperable graph-based representation of distributed AI/ML workloads, called Chakra execution trace~(ET). These ETs represent key operations, such as compute, memory, and communication, data and control dependencies, timing, and resource constraints. Additionally, Chakra includes a complementary set of tools and capabilities to enable the collection, analysis, generation, and adoption of Chakra ETs by a broad range of simulators, emulators, and replay tools. We present analysis of Chakra ETs collected on production AI clusters and demonstrate value via real-world case studies. Chakra has been adopted by MLCommons and has active contributions and engagement across the industry, including but not limited to NVIDIA, AMD, Meta, Keysight, HPE, and Scala, to name a few.
翻译:人工智能(AI)创新的快速步伐要求采用敏捷方法论,用于观察、复现和优化生产型AI系统中的分布式机器学习(ML)工作负载行为,并支持未来系统的高效软件-硬件(SW-HW)协同设计。我们提出Chakra,一个开放且可移植的生态系统,用于性能基准测试与协同设计。其核心组件是基于图的开放且可互操作的分布式AI/ML工作负载表示形式,称为Chakra执行轨迹(ET)。这些ET表示关键操作(如计算、内存和通信)、数据与控制依赖关系、时序及资源约束。此外,Chakra包含一套补充工具与能力,使广泛使用的模拟器、仿真器和回放工具能够收集、分析、生成和采用Chakra ET。我们展示在生产型AI集群上采集的Chakra ET分析结果,并通过真实案例研究验证其价值。Chakra已被MLCommons采纳,并获得了包括NVIDIA、AMD、Meta、Keysight、HPE和Scala等在内的行业各方的积极贡献与参与。