The integration of Large Language Models (LLMs) into wireless networks presents significant potential for automating system design. However, unlike conventional throughput maximization, Covert Communication (CC) requires optimizing transmission utility under strict detection-theoretic constraints, such as Kullback-Leibler divergence limits. Existing benchmarks primarily focus on general reasoning or standard communication tasks and do not adequately evaluate the ability of LLMs to satisfy these rigorous security constraints. To address this limitation, we introduce CovertComBench, a unified benchmark designed to assess LLM capabilities across the CC pipeline, encompassing conceptual understanding (MCQs), optimization derivation (ODQs), and code generation (CGQs). Furthermore, we analyze the reliability of automated scoring within a detection-theoretic ``LLM-as-Judge'' framework. Extensive evaluations across state-of-the-art models reveal a significant performance discrepancy. While LLMs achieve high accuracy in conceptual identification (81%) and code implementation (83%), their performance in the higher-order mathematical derivations necessary for security guarantees ranges between 18% and 55%. This limitation indicates that current LLMs serve better as implementation assistants rather than autonomous solvers for security-constrained optimization. These findings suggest that future research should focus on external tool augmentation to build trustworthy wireless AI systems.
翻译:将大语言模型(LLMs)集成到无线网络中为系统设计自动化带来了巨大潜力。然而,与传统的吞吐量最大化不同,隐蔽通信(CC)需要在严格的检测理论约束(如Kullback-Leibler散度限制)下优化传输效用。现有基准测试主要关注通用推理或标准通信任务,未能充分评估大语言模型满足这些严格安全约束的能力。为弥补这一不足,我们提出了CovertComBench——一个旨在评估大语言模型在整个隐蔽通信流程中能力的统一基准测试,涵盖概念理解(MCQs)、优化推导(ODQs)和代码生成(CGQs)。此外,我们在检测理论的“LLM-as-Judge”框架内分析了自动化评分的可靠性。通过对前沿模型的大规模评估,我们发现了显著的性能差异:尽管大语言模型在概念识别(81%)和代码实现(83%)方面取得了较高准确率,但在保障安全性所需的高阶数学推导任务中,其性能仅在18%至55%之间。这一局限表明,当前大语言模型更适合作为实现辅助工具,而非安全约束优化问题的自主求解器。这些发现提示,未来研究应聚焦于外部工具增强,以构建可信赖的无线人工智能系统。