Large-scale distributed training has been a research hot spot in machine learning systems for industry and academia in recent years. However, conducting experiments without physical machines and corresponding resources is difficult. One solution is to leverage distributed training simulators, but current ones like ASTRA-sim do not support importing real-world developed models, which poses challenges for ML researchers seeking to use them. Based on this challenge, we developed ModTrans, a translator supporting format translation from any real-world model to the ASTRA-sim simulator's input, removing the barrier between machine learning experts and machine learning system researchers. The experiment results show that ModTrans's cost is negligible.
翻译:大规模分布式训练近年来已成为机器学习系统领域工业界和学术界的研究热点。然而,在没有物理机器及相应资源的情况下开展实验存在困难。一种解决方案是利用分布式训练模拟器,但当前如ASTRA-sim等模拟器不支持导入真实世界开发的模型,这对希望使用这些模拟器的机器学习研究人员构成了挑战。基于这一难题,我们开发了ModTrans——一种支持将任意真实世界模型格式转换为ASTRA-sim模拟器输入的翻译器,从而消除了机器学习专家与机器学习系统研究者之间的壁垒。实验结果表明,ModTrans的代价可忽略不计。