Large Language Models for Agentic NetOps and AIOps: Architectures, Evaluation, and Safety

Large language models are increasingly being used to support network operations (NetOps) and artificial intelligence for IT operations (AIOps), including incident investigation, root-cause analysis, configuration synthesis, and limited self-healing. In both NetOps and AIOps, this shift is changing how tasks are managed. Agent-based operations work as workflows, from gathering evidence to taking action, following permissions, policies, and checks, and providing rollback options when necessary. This is crucial because operational decisions can have instant impacts. To make the argument concrete, we organise the relevant literature around the hierarchy of autonomy, tool scope, evidence traces, and assurance contracts. These contracts define what an agent may observe, propose, and execute. They also define the checks that must pass before any action is allowed. A consistent pattern appears across work on telemetry query recommendation, diagnosis, root-cause analysis, configuration synthesis, change planning, and limited self-healing. Operational reliability does not come chiefly from the model itself. It depends on the machinery around the model. We also argue that evaluation should go beyond static question answering. Agentic NetOps and AIOps systems require workflow-centred evaluation, including trace quality, bounded tool use, safe proposal generation, replay in sandboxed environments, and canary trials with rollback-aware scoring. Without these measures, a system may appear robust yet remain too fragile. Finally, we examine security, privacy, and governance risks that become acute when agents sit close to operational control surfaces. Taken together, the survey concludes that progress in intelligent NetOps and AIOps will depend on treating autonomy as a constrained operational control problem, whose outputs must be reliable, auditable, and securely deployable.

翻译：大语言模型正日益广泛地应用于网络运营（NetOps）和面向IT运维的人工智能（AIOps），涵盖事件调查、根因分析、配置合成以及有限的自我修复。在NetOps和AIOps领域，这一转变正在改变任务的管理方式。基于智能体的运维以工作流形式运作，从证据收集到行动执行，遵循权限、策略和检查机制，并在必要时提供回滚选项。这一点至关重要，因为运营决策可能产生即时影响。为具体阐述这一论点，我们围绕自主层级、工具范围、证据追踪和保证契约，梳理了相关文献。这些契约界定了智能体可以观察、提议和执行的内容，同时也规定了任何行动执行前必须通过的各项检查。在遥测查询推荐、诊断、根因分析、配置合成、变更规划以及有限自我修复等工作中，呈现出一种一致的模式。运营可靠性主要并非来自模型本身，而是依赖于模型周边的配套机制。我们还认为，评估不应局限于静态问答。自主NetOps和AIOps系统需要以工作流为中心的评估，包括追踪质量、受限的工具使用、安全的方案生成、沙盒环境中的回放，以及具备回滚感知评分的金丝雀测试。缺乏这些措施，系统可能看似稳健，实则过于脆弱。最后，我们审视了当智能体靠近运营控制面时变得尤为突出的安全、隐私和治理风险。综合来看，本综述得出结论：智能NetOps和AIOps的进展，将取决于将自主性视为一个受约束的运营控制问题，其输出必须可靠、可审计且可安全部署。