Teal: Learning-Accelerated Optimization of WAN Traffic Engineering

The past decade has witnessed a rapid expansion of global cloud wide-area networks (WANs) with the deployment of new network sites and datacenters, making it challenging for commercial optimization engines to solve the network traffic engineering (TE) problem quickly at scale. Current approaches to accelerating TE optimization decompose the task into subproblems that can be solved in parallel using optimization solvers, but they are fundamentally restricted to a few dozen subproblems in order to balance run time and TE performance, achieving limited parallelism and speedup. Motivated by the ability to readily access thousands of threads on GPUs through modern deep learning frameworks, we propose a learning-based TE algorithm -- Teal, which harnesses the parallel processing power of GPUs to accelerate TE control. First, Teal designs a flow-centric graph neural network (GNN) to capture WAN connectivity and model network flows, learning flow features as inputs to the downstream allocation. Second, to reduce the problem scale and make learning tractable, Teal employs a multi-agent reinforcement learning (RL) algorithm to allocate each traffic demand independently toward optimizing a central TE objective. Finally, Teal fine-tunes the resulting flow allocations using alternating direction method of multipliers (ADMM), a highly parallelizable constrained optimization algorithm for reducing constraint violations (e.g., overused links). We evaluate Teal on traffic matrices collected from a global cloud provider, and show that on a large WAN topology with over 1,700 nodes, Teal generates near-optimal flow allocations while being several orders of magnitude faster than the production optimization engine. Compared with other TE acceleration schemes, Teal satisfies up to 29% more traffic demands and yields up to 109x speedups.

翻译：过去十年间，随着新网络站点和数据中心的部署，全球云广域网经历了快速扩张，这使得商业优化引擎难以在大规模网络中快速求解流量工程问题。当前加速流量工程优化的方法将任务分解为可并行求解的子问题，但为了平衡运行时间和流量工程性能，这类方法本质上受限于数十个子问题的规模，导致并行度和加速效果有限。受现代深度学习框架能够在GPU上轻松访问数千线程的启发，我们提出了一种基于学习的流量工程算法——Teal，其利用GPU的并行处理能力加速流量工程控制。首先，Teal设计了一种以流为中心的图神经网络，用于捕获广域网连接性并对网络流建模，学习流特征作为下游分配任务的输入。其次，为降低问题规模并使学习可行，Teal采用多智能体强化学习算法，在优化中央流量工程目标的同时独立分配每个流量需求。最后，Teal利用交替方向乘子法（一种高度可并行的约束优化算法）微调生成的流分配，以减少约束违反（如链路过载）。我们使用全球云提供商收集的流量矩阵评估Teal，结果表明，在包含超过1700个节点的大型广域网拓扑上，Teal生成的流分配接近最优，且速度比生产级优化引擎快数个数量级。与其他流量工程加速方案相比，Teal多满足高达29%的流量需求，并实现了高达109倍的加速比。