Design space exploration for future distributed Machine Learning systems suffers from a lack of readily available workload representation that enables flexible exploration across the stack. We present Flint, a framework that bridges this gap by leveraging the Intermediate Representation of Machine Learning framework compilers. The compiler does the heavy weight lifting of understanding and preserving the behavior of the original model code. Flint can collect the workload representation of arbitrary cluster size because it interfaces with the compiler before hardware execution. We validate the workload graph against post-execution traces and show the flexibility of Flint through a design space exploration case study.
翻译:面向未来分布式机器学习系统的设计空间探索,因缺乏现成的工作负载表征以支持跨栈灵活探索而面临困境。本文提出Flint框架,通过利用机器学习框架编译器的中间表示来弥合这一鸿沟。编译器承担了理解并保留原始模型代码行为的繁重工作。由于Flint在硬件执行前与编译器交互,因此能够采集任意集群规模的工作负载表征。我们通过执行后轨迹验证了工作负载图,并通过设计空间探索案例研究展示了Flint的灵活性。