The fast pace of artificial intelligence~(AI) innovation demands an agile methodology for observation, reproduction and optimization of distributed machine learning~(ML) workload behavior in production AI systems and enables efficient software-hardware~(SW-HW) co-design for future systems. We present Chakra, an open and portable ecosystem for performance benchmarking and co-design. The core component of Chakra is an open and interoperable graph-based representation of distributed AI/ML workloads, called Chakra execution trace~(ET). These ETs represent key operations, such as compute, memory, and communication, data and control dependencies, timing, and resource constraints. Additionally, Chakra includes a complementary set of tools and capabilities to enable the collection, analysis, generation, and adoption of Chakra ETs by a broad range of simulators, emulators, and replay tools. We present analysis of Chakra ETs collected on production AI clusters and demonstrate value via real-world case studies. Chakra has been adopted by MLCommons and has active contributions and engagement across the industry, including but not limited to NVIDIA, AMD, Meta, Keysight, HPE, and Scala, to name a few.
翻译:人工智能(AI)创新的飞速发展要求采用敏捷方法论,以观察、复现和优化生产级AI系统中分布式机器学习(ML)工作负载的行为,并为未来系统的软硬件(SW-HW)协同设计提供高效支撑。我们提出Chakra——一个面向性能基准测试与协同设计的开源可移植生态系统。其核心组件是一种基于图结构的开放式可互操作分布式AI/ML工作负载表示,称为Chakra执行迹(ET)。这些ET表征了关键操作(如计算、内存和通信)、数据与控制依赖、时序及资源约束。此外,Chakra还包含一套互补的工具与功能,以支持广泛的模拟器、仿真器和重放工具对Chakra ET进行采集、分析、生成与采纳。我们展示了在生产级AI集群上收集的Chakra ET分析结果,并通过实际案例研究证明了其价值。Chakra已被MLCommons采纳,并获得包括NVIDIA、AMD、Meta、Keysight、HPE和Scala等在内的业界广泛贡献与参与。