Cellular networks, e.g., 4G/5G, rely on complex technical specifications to ensure correct functionality; however, these specifications often contain flaws or ambiguities. In this paper, we investigate the application of Large Language Models for automated cellular network specification refinement. We identify Change Requests, which record specification revisions, as a key source of domain-specific data and formulate specification refinement as three complementary sub-tasks. We introduce CR-Eval, a benchmark of 200 security-related test cases, and evaluate 17 open-source and 14 proprietary models. The best-performing model, GPT-o3-mini, identifies weaknesses in over 127 test cases within five trials. We further study LLM specialization, showing that fine-tuning an 8B model can outperform advanced LLMs such as DeepSeek-R1 and Qwen3-235B. Evaluations on 30 real-world cellular attacks demonstrate the practical impact and remaining challenges. The codebase and benchmark are available at https://github.com/jianshuod/CR-Eval.
翻译:蜂窝网络(如4G/5G)依赖复杂的技术规范以确保功能正确性,然而这些规范常存在缺陷或模糊之处。本文研究了大型语言模型在蜂窝网络规范自动化精炼中的应用。我们将记录规范修订的变更请求识别为领域特定数据的关键来源,并将规范精炼构建为三个互补的子任务。我们提出了CR-Eval基准,包含200个安全相关测试用例,并评估了17个开源模型与14个专有模型。性能最佳的GPT-o3-mini模型在五次尝试中识别出超过127个测试用例的缺陷。我们进一步研究了LLM专业化,表明对8B参数模型的微调可超越DeepSeek-R1和Qwen3-235B等先进大语言模型。基于30个真实蜂窝网络攻击的评估验证了其实用价值与现存挑战。代码库与基准测试集已发布于https://github.com/jianshuod/CR-Eval。