The rapid expansion of global cloud wide-area networks (WANs) has posed a challenge for commercial optimization engines to efficiently solve network traffic engineering (TE) problems at scale. Existing acceleration strategies decompose TE optimization into concurrent subproblems but realize limited parallelism due to an inherent tradeoff between run time and allocation performance. We present Teal, a learning-based TE algorithm that leverages the parallel processing power of GPUs to accelerate TE control. First, Teal designs a flow-centric graph neural network (GNN) to capture WAN connectivity and network flows, learning flow features as inputs to downstream allocation. Second, to reduce the problem scale and make learning tractable, Teal employs a multi-agent reinforcement learning (RL) algorithm to independently allocate each traffic demand while optimizing a central TE objective. Finally, Teal fine-tunes allocations with ADMM (Alternating Direction Method of Multipliers), a highly parallelizable optimization algorithm for reducing constraint violations such as overutilized links. We evaluate Teal using traffic matrices from Microsoft's WAN. On a large WAN topology with >1,700 nodes, Teal generates near-optimal flow allocations while running several orders of magnitude faster than the production optimization engine. Compared with other TE acceleration schemes, Teal satisfies 6--32% more traffic demand and yields 197--625x speedups.
翻译:摘要:全球云广域网规模的快速扩张,给商业优化引擎高效求解大规模网络流量工程问题带来了挑战。现有加速策略将流量工程优化分解为并发子问题,但由于运行时间与资源分配性能之间存在固有权衡,其并行程度有限。本文提出Teal,一种利用GPU并行处理能力加速流量工程控制的基于学习算法。首先,Teal设计了一种以流量为中心的图神经网络来捕获广域网连通性与网络流,学习流特征作为下游资源分配的输入。其次,为降低问题规模并使学习过程可处理,Teal采用多智能体强化学习算法,在优化集中式流量工程目标的同时,独立分配每条流量需求。最后,Teal使用交替方向乘子法(一种可高度并行化的优化算法)对分配结果进行微调,以减少过载链路等约束违反。我们使用微软广域网的流量矩阵评估Teal:在包含超过1700个节点的大规模广域网拓扑上,Teal生成的流分配结果接近最优,同时运行速度比生产级优化引擎快数个数量级。与其他流量工程加速方案相比,Teal多满足6%-32%的流量需求,并实现197-625倍加速比。