We propose NeuronaBox, a flexible, user-friendly, and high-fidelity approach to emulate DNN training workloads. We argue that to accurately observe performance, it is possible to execute the training workload on a subset of real nodes and emulate the networked execution environment along with the collective communication operations. Initial results from a proof-of-concept implementation show that NeuronaBox replicates the behavior of actual systems with high accuracy, with an error margin of less than 1% between the emulated measurements and the real system.
翻译:我们提出NeuronaBox,一种灵活、用户友好且高保真的深度神经网络训练工作负载仿真方法。我们认为,为了准确观测性能,可以在部分真实节点上执行训练工作负载,同时对网络化执行环境及集合通信操作进行仿真。概念验证实现的初步结果表明,NeuronaBox以高精度复现了实际系统的行为,仿真测量与实际系统之间的误差小于1%。