As telecommunications operators accelerate adoption of AI-enabled automation, a practical question remains unresolved: can general-purpose large language model (LLM) agents reliably execute telecom operations workflows through real API interfaces, or do they require structured domain guidance? We introduce SKILLS (Structured Knowledge Injection for LLM-driven Service Lifecycle operations), a benchmark framework comprising 37 telecom operations scenarios spanning 8 TM Forum Open API domains (TMF620, TMF621, TMF622, TMF628, TMF629, TMF637, TMF639, TMF724). Each scenario is grounded in live mock API servers with seeded production-representative data, MCP tool interfaces, and deterministic evaluation rubrics combining response content checks, tool-call verification, and database state assertions. We evaluate open-weight models under two conditions: baseline (generic agent with tool access but no domain guidance) and with-skill (agent augmented with a portable SKILL.md document encoding workflow logic, API patterns, and business rules). Results across 5 open-weight model conditions and 185 scenario-runs show consistent skill lift across all models. MiniMax M2.5 leads (81.1% with-skill, +13.5pp), followed by Nemotron 120B (78.4%, +18.9pp), GLM-5 Turbo (78.4%, +5.4pp), and Seed 2.0 Lite (75.7%, +18.9pp).
翻译:随着电信运营商加速采用人工智能驱动的自动化技术,一个实际问题仍未解决:通用大语言模型(LLM)智能体能否通过真实的API接口可靠地执行电信运营工作流,还是需要结构化的领域指导?我们提出了SKILLS(面向大语言模型驱动服务生命周期运营的结构化知识注入),这是一个基准测试框架,包含覆盖8个TM Forum Open API领域(TMF620、TMF621、TMF622、TMF628、TMF629、TMF637、TMF639、TMF724)的37个电信运营场景。每个场景均基于植入生产级代表性数据的实时模拟API服务器、MCP工具接口,以及结合了响应内容检查、工具调用验证和数据库状态断言的确定性评估标准。我们在两种条件下评估开源模型:基线条件(仅具备工具访问权限但无领域指导的通用智能体)和技能增强条件(通过可移植的SKILL.md文档增强的智能体,该文档编码了工作流逻辑、API模式和业务规则)。在5种开源模型条件和185个场景运行中的结果显示,所有模型均表现出持续的技能提升。MiniMax M2.5表现最佳(技能增强条件下81.1%,提升13.5个百分点),其次是Nemotron 120B(78.4%,提升18.9个百分点)、GLM-5 Turbo(78.4%,提升5.4个百分点)和Seed 2.0 Lite(75.7%,提升18.9个百分点)。