In production Wide-Area Networks (WANs), correlated failures dominate availability losses, forcing operators to reserve large safety margins that leave substantial capacity underutilized. Achieving high utilization under strict availability targets therefore requires risk-aware Traffic Engineering (TE) over dozens to hundreds of probabilistic failure scenarios-yet solving this problem at operational timescales remains elusive. We demonstrate that existing risk-aware formulations can be unified under an embedded Sort-and-Select structure, exposing a fundamental trade-off between expressiveness and tractability: classical optimizers either restrict scenario selection for efficiency or incur prohibitive decomposition costs. While deep learning appears promising, prior Deep TE methods mainly target maximum link utilization and rely on scaling-based feasibility, which fundamentally breaks under explicit capacity constraints and scenario-dependent risk. We present NeuroRisk, a physics-informed deep unrolled optimizer that exploits the structure of Sort-and-Select. NeuroRisk enforces feasibility via gated edge-local reservations and represents scenario sets through permutation-invariant, gradient-aligned cues. Evaluations on production-style WANs show that NeuroRisk achieves small optimality gaps relative to the solver with orders of magnitude speedup $(10^2- 10^5 \times)$ on risk objectives, while outperforming neural baselines on nominal throughput.
翻译:在生产级广域网中,关联故障主导了可用性损失,迫使运营商预留大量安全余量,导致大量容量未被充分利用。因此,在严格的可用性目标下实现高利用率,需要在数十到数百个概率故障场景中进行风险感知流量工程——然而,在运营时间尺度上解决该问题仍难以实现。我们证明,现有风险感知公式可统一于嵌入排序与选择结构,暴露出表达能力与可求解性之间的基本权衡:经典优化器要么为效率限制场景选择,要么承担高昂的分解代价。尽管深度学习看似前景广阔,但现有深度流量工程方法主要针对最大链路利用率,并依赖基于缩放的可行使性,这从根本上在显式容量约束和场景相关风险下失效。我们提出NeuroRisk——一种利用排序与选择结构的物理知情深度展开优化器。NeuroRisk通过门控边缘局部预留确保可行使性,并通过置换不变、梯度对齐线索表示场景集。在生产级广域网上的评估表明,NeuroRisk相对于求解器在风险目标上实现了较小的最优性差距,同时加速了数个数量级($10^2-10^5\times$),并在标称吞吐量上优于神经基线方法。