Training large language models (LLMs), and other large machine learning models, involves repeated communication of large volumes of data across a data center network. The communication patterns induced by these training process exhibit high regularity and persistence, giving rise to significant opportunities for optimizing the manner in which flows are routed across the network. We present an algorithmic framework for \textit{quantifying} network-wide efficiency in the context of training LLMs (and other large-scale ML models), and for periodically \textit{optimizing} routing with respect to this global metric.
翻译:训练大规模语言模型(LLMs)及其他大规模机器学习模型,需要在数据中心网络中反复传输海量数据。此类训练过程引发的通信模式具有高度规律性和持续性,这为优化网络流的路由方式带来了重要机遇。我们提出一个算法框架,用于在训练LLMs(及其他大规模机器学习模型)的背景下**量化**全网效率,并基于这一全局指标周期性地**优化**路由策略。