The rapid expansion of global cloud wide-area networks (WANs) has posed a challenge for commercial optimization engines to efficiently solve network traffic engineering (TE) problems at scale. Existing acceleration strategies decompose TE optimization into concurrent subproblems but realize limited parallelism due to an inherent tradeoff between run time and allocation performance. We present Teal, a learning-based TE algorithm that leverages the parallel processing power of GPUs to accelerate TE control. First, Teal designs a flow-centric graph neural network (GNN) to capture WAN connectivity and network flows, learning flow features as inputs to downstream allocation. Second, to reduce the problem scale and make learning tractable, Teal employs a multi-agent reinforcement learning (RL) algorithm to independently allocate each traffic demand while optimizing a central TE objective. Finally, Teal fine-tunes allocations with ADMM (Alternating Direction Method of Multipliers), a highly parallelizable optimization algorithm for reducing constraint violations such as overutilized links. We evaluate Teal using traffic matrices from Microsoft's WAN. On a large WAN topology with >1,700 nodes, Teal generates near-optimal flow allocations while running several orders of magnitude faster than the production optimization engine. Compared with other TE acceleration schemes, Teal satisfies 6--32% more traffic demand and yields 197--625x speedups.
翻译:全球云广域网(WAN)的快速扩展对商业优化引擎大规模高效求解网络流量工程(TE)问题提出了挑战。现有加速策略将TE优化分解为并发子问题,但由于运行时间与分配性能之间存在固有权衡,并行度受限。我们提出基于学习的TE算法Teal,利用GPU并行处理能力加速TE控制。首先,Teal设计以流量为中心的图神经网络(GNN)捕捉WAN连接性与网络流量,学习流量特征作为下游分配的输入。其次,为降低问题规模并实现可学习性,Teal采用多智能体强化学习(RL)算法独立分配每个流量需求,同时优化集中式TE目标。最后,Teal使用高度可并行的ADMM(交替方向乘子法)微调分配方案,以减少链路过度利用等约束违反。我们采用微软WAN的流量矩阵评估Teal。在包含超过1700个节点的大型WAN拓扑上,Teal生成接近最优的流量分配,同时运行速度比生产级优化引擎快数个数量级。与其他TE加速方案相比,Teal能够满足6%-32%更多的流量需求,并实现197-625倍的加速比。