How Small Can 6G Reason? Scaling Tiny Language Models for AI-Native Networks

Emerging 6G visions, reflected in ongoing standardization efforts within 3GPP, IETF, ETSI, ITU-T, and the O-RAN Alliance, increasingly characterize networks as AI-native systems in which high-level semantic reasoning layers operate above standardized control and data-plane functions. Although frontier-scale large language models (LLMs) such as Qwen2.5-7B and Olmo-3-7B demonstrate strong reasoning capability, their computational footprint limits deployment in latency-sensitive, edge-native infrastructures. This paper presents a systematic empirical study of the scaling behavior and deployment efficiency of compact language models for network-level semantic reasoning in AI-native 6G systems. Using 6G-Bench, a standardization-aligned benchmark comprising 30 decision-making tasks across five capability domains, we evaluate models ranging from 135M (SmolLM2-135M) to 7B parameters (Qwen2.5-7B), including mid-scale architectures such as Llama-3.2-1B, Granite-1B, and Qwen2.5-3B. Deterministic accuracy (pass@1) increases from 0.224 at 135M to 0.707 at 7B, but scaling gains are highly non-uniform. A pronounced stability transition occurs in the 1 to 1.5B range, where accuracy rises from 0.373 (Llama-3.2-1B) to 0.531 (Qwen2.5-1.5B) and the instability gap Delta_5 contracts from 0.356 to 0.138. Beyond 3B parameters, improvements diminish (+0.064 from 3B to 7B). Through single-query inference profiling and an Edge Score metric that normalizes accuracy by latency and memory footprint, we show that semantic reliability per unit edge resource does not scale monotonically with parameter count. Instead, mid-scale models (approximately 1.5 to 3B) achieve the most favorable balance between deterministic stability and computational efficiency, providing deployment-relevant guidance for AI-native 6G architectures. All scripts and results are publicly available at https://github.com/maferrag/6G-Bench

翻译：3GPP、IETF、ETSI、ITU-T和O-RAN联盟等标准组织持续推进的6G愿景，日益将网络定义为AI原生系统，其中高层语义推理层运行于标准化控制面与数据面功能之上。尽管前沿规模的大型语言模型（如Qwen2.5-7B和Olmo-3-7B）展现出强大的推理能力，但其计算开销限制了在时延敏感的边缘原生基础设施中的部署。本文针对AI原生6G系统中网络级语义推理的紧凑语言模型，开展了系统性实证研究，探讨其规模化行为与部署效率。通过采用6G-Bench——一个包含五大能力域30项决策任务且与标准化对齐的基准测试平台，我们评估了参数量从1.35亿（SmolLM2-135M）到70亿（Qwen2.5-7B）的模型，包括Llama-3.2-1B、Granite-1B和Qwen2.5-3B等中等规模架构。确定性准确率（pass@1）从1.35亿参数的0.224提升至70亿参数的0.707，但规模化增益呈现高度非均匀性。在10亿至15亿参数范围内出现显著的稳定性跃迁：准确率从0.373（Llama-3.2-1B）升至0.531（Qwen2.5-1.5B），不稳定间隙Δ_5从0.356收缩至0.138。超过30亿参数后，改进幅度减弱（从30亿到70亿仅提升0.064）。通过单查询推理性能剖析以及结合时延与内存开销归一化准确率的边缘评分指标，我们证明单位边缘资源的语义可靠性并不随参数量单调增长。相反，中等规模模型（约15亿至30亿参数）在确定性稳定性与计算效率之间达到了最优平衡，为AI原生6G架构提供了具有部署指导意义的结论。所有实验脚本与结果已公开于https://github.com/maferrag/6G-Bench。