Network configurations are prone to errors, which can lead to catastrophic service outages. A tool that can achieve automatic configuration repair (ACR) is highly desired by operators. Existing tools for ACR follow a semantic-driven approach: they model network semantics as a set of SMT constraints, and solve them for a location or fix of the error. Due to the complex semantics of networks, constructing and solving these constraints can be prohibitively expensive, making these tools neither general nor scalable. Inspired by automatic program repair (APR), we explore another direction, i.e., a syntax-driven approach, which tries to repair program bugs by ``grafting'' some existing code in the same repository, without modeling program semantics. Following this direction, we propose Astragalus, a syntax-driven method for ACR. It uses multiple iterations of a ``localize-fix-validate'' pipeline to search for repairs, and proves quite effective on configurations of our production network. Specifically, we show that Astragalus can repair every incident in multiple sizes of a synthesized network, and 97.5\% of the incidents on a real network, both with 15 types of errors injected, within an average time of 7.36 seconds. It has also provided valid repair options in under 6 minutes for 4 recent network incidents or undesired changes, in a real production network with O(1,000)Õ(10,000) devices.
翻译:摘要:网络配置易出错,可能导致灾难性服务中断。运营商迫切需求能够实现自动配置修复(ACR)的工具。现有ACR工具采用语义驱动方法:将网络语义建模为SMT约束集,并通过求解约束定位或修复错误。由于网络语义的复杂性,构建和求解这些约束的成本过高,导致此类工具既缺乏通用性也难以扩展。受自动程序修复(APR)启发,我们探索了另一种方向——语法驱动方法,该方法无需对程序语义建模,而是通过"嫁接"同一代码仓库中的现有代码来修复程序漏洞。基于这一方向,我们提出语法驱动的ACR方法Astragalus。它采用"定位-修复-验证"流水线的多次迭代搜索修复方案,并在生产网络配置中展现出显著效果。具体而言,在注入15种错误类型的实验中,Astragalus能在平均7.36秒内修复合成网络中不同规模下的所有故障事件,以及真实网络中97.5%的故障事件。此外,在包含O(1,000)~O(10,000)台设备的真实生产网络中,该工具可在6分钟内为4个近期网络故障或非预期变更提供有效修复方案。