ASTRA-sim 3.0: Next-Level Distributed Machine Learning Simulations via High-Fidelity GPU and Infrastructure Modeling

William Won,Jinsun Yoo,Tuan Ta,Moumita Dey,Andy Balogh,Pradosh Datta,Furkan Eris,Conor Green,Winston Liu,Changhai Man,Kingshuk Mandal,Amos Rai,Vinay Ramakrishnaiah,Ruchi Shah,David Sidler,Harsh Sikhwal,Hanjiang Wu,Tushar Krishna,Bradford M. Beckmann

from arxiv, 10 pages, 15 figures, one table

Distributed machine learning (ML) is a key paradigm for today's large-scale artificial intelligence applications. As model inference arises as an important use case, faithful modeling of latency-sensitive collective communication has never been more important. Capturing the device architecture and modeling control and data paths at high fidelity is therefore a necessity today. Having a common, detailed representation for distributed ML infrastructure is also crucial. We revisit the promising open-source, community-driven simulator: ASTRA-sim. In this work, we identify limitations of the current ASTRA-sim simulator and augment it with new features. To this end, we enable fine-grained, high-fidelity simulation with a standardized infrastructure representation, opening new design space exploration opportunities. We propose the simulation at cache-line-sized load-store granularity, with a detailed graphics processing unit (GPU) execution model, to balance simulation scalability and fidelity. We also introduce InfraGraph, a standardized representation to capture distributed ML network infrastructure in detail. Using the updated ASTRA-sim 3.0 simulator, we showcase interesting design space explorations for designing optimized collective algorithms, network requirements, and GPU architectures.

翻译：分布式机器学习（ML）是当今大规模人工智能应用的关键范式。随着模型推理成为重要用例，对延迟敏感的集合通信进行精确建模比以往任何时候都更为重要。因此，高保真度地捕获设备架构并建模控制与数据路径已成为当前的必要需求。建立分布式机器学习基础设施的通用、详细表示同样至关重要。我们重新审视了由社区驱动的开源仿真器ASTRA-sim。在本工作中，我们识别了当前ASTRA-sim仿真器的局限性，并为其增加了新功能。为此，我们通过标准化的基础设施表示实现了细粒度、高保真的仿真，开辟了新的设计空间探索机会。我们提出在缓存行大小的加载-存储粒度上进行仿真，并采用详细的图形处理器（GPU）执行模型，以平衡仿真的可扩展性与保真度。我们还引入了InfraGraph，这是一种标准化表示，用于详细捕获分布式机器学习网络基础设施。通过更新后的ASTRA-sim 3.0仿真器，我们展示了针对优化集合算法、网络需求及GPU架构的有趣设计空间探索。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《改进机器学习管道中的人类集成》人机协作最新263页论文

专知会员服务

33+阅读 · 2024年8月13日

【博士论文】面向边缘智能的高效微型机器学习系统，212页pdf

专知会员服务

60+阅读 · 2024年2月25日

【2024新书】分布式机器学习模式

专知会员服务

90+阅读 · 2024年1月24日

博士论文《联邦学习仿真器》221页，米兰理工大学

专知会员服务

32+阅读 · 2023年3月14日