NEST: Network- and Memory-Aware Device Placement For Distributed Deep Learning

The growing scale of deep learning demands distributed training frameworks that jointly reason about parallelism, memory, and network topology. Prior works often rely on heuristic or topology-agnostic search, handling communication and memory separately. Without per-device memory awareness, these methods typically ensure feasibility post hoc by sharding parameters and activations across many devices, increasing synchronization, inflating communication, and underutilizing compute-limiting scalability and efficiency on real datacenter networks. We present NEST, a network-, compute-, and memory-aware device placement framework that unifies model parallelism, topology modeling, and memory feasibility via structured dynamic programming. NEST's DP operates on operator graphs with tensor and expert parallel configurations, explicit allreduce latencies across hierarchical or arbitrary networks, and memory/compute profiles. By factoring parallelism across tensor, pipeline, data, and expert dimensions, NEST defines a principled search space for hybrid strategies while jointly optimizing co-location, network latency, and memory feasibility. Evaluations across diverse hardware and networks show NEST achieves up to 2.43 times higher throughput, better memory efficiency, and improved scalability over state-of-the-art baselines, providing a foundation for co-designing parallelization strategies and datacenter interconnects for next-generation AI infrastructure. The source code of NEST is available at: https://github.com/scai-tech/Nest

翻译：深度学习规模的持续增长要求分布式训练框架能够协同推理并行策略、内存占用及网络拓扑。现有方法通常依赖启发式搜索或拓扑无关搜索，将通信与内存管理分离处理。由于缺乏逐设备内存感知能力，这些方法通常事后通过跨多设备分片参数和激活值来保证可行性，这增加了同步开销、扩大了通信负担、降低了计算资源利用率，从而限制了真实数据中心网络中的可扩展性和效率。我们提出NEST——一种统一考虑网络、计算与内存感知的设备放置框架，通过结构化动态规划融合模型并行、拓扑建模与内存可行性验证。NEST的动态规划机制支持算子图上的张量与专家并行配置、层级化或任意拓扑下的显式allreduce延迟建模，以及内存/计算特征分析。通过沿着张量、流水线、数据和专家维度分解并行策略，NEST为混合策略定义了严谨的搜索空间，同时联合优化设备共置、网络延迟与内存可行性。在多样化硬件与网络环境下的评估表明：相比现有最优基线方法，NEST最高可实现2.43倍吞吐量提升、更优的内存效率及更强的可扩展性，为下一代AI基础设施中并行策略与数据中心互连的协同设计奠定基础。NEST源码开源地址：https://github.com/scai-tech/Nest