Network configurations are prone to errors, which can lead to catastrophic service outages. A tool that can achieve automatic configuration repair (ACR) is highly desired by operators. Existing tools for ACR follow a \textit{semantics-driven approach}: they model network semantics as a set of SMT constraints, and solve them for a location or fix of the error. Due to the complex semantics of networks, constructing and solving these constraints can be prohibitively expensive, making these tools neither general nor scalable. Inspired by automatic program repair (APR), we explore another direction, i.e., a \textit{syntax-driven approach}, which generates and validates syntactically-valid candidate updates without modeling program semantics, often drawing on existing code in the same repository. Following this direction, we propose Astragalus, a syntax-driven method for ACR. It uses multiple iterations of a "localize-fix-validate" pipeline to search for repairs, and proves quite effective on configurations of our production network. Specifically, we show that Astragalus can repair every incident in multiple sizes of a synthesized network, and 97.5% of the incidents on a real network, both with 15 types of errors injected, within an average time of 6.93 seconds. It has also provided valid repairs in under 6 minutes for 7 recent network incidents or undesired changes, in a real production network with O(1,000)~O(10,000) devices.
翻译:摘要:网络配置易出错,可能导致灾难性服务中断。运营商迫切需要一种能实现自动配置修复(ACR)的工具。现有ACR工具采用语义驱动方法:将网络语义建模为一组SMT约束,并通过求解这些约束来定位或修复错误。由于网络语义复杂,构建和求解这些约束可能代价高昂,导致这类工具既缺乏通用性也难以扩展。受自动程序修复(APR)启发,我们探索了另一种方向,即语法驱动方法——无需建模程序语义,通过生成并验证语法有效的候选更新,通常借鉴同一代码库中的现有代码。沿着这一方向,我们提出了Astragalus——一种用于ACR的语法驱动方法。它采用多轮"定位-修复-验证"流水线搜索修复方案,并在我们的生产网络配置中证明相当有效。具体而言,我们证明Astragalus能在平均6.93秒内修复合