Knowledge Boundary Probing and Demand-Guided Intervention for LLM-Based Power System Code Generation

Large language models (LLMs) are increasingly used to automate power-system analysis, but many utilities and energy-research labs require on-premise serving for confidentiality, regulatory, reproducibility, and cost reasons. This makes the reliability of open-weight models a deployment issue. We show that first-pass failures in power-system code generation are dominated not by reasoning alone, but by structured API-knowledge boundary errors: hallucinated function names, misused parameters, and mishandled result tables in versioned simulation libraries. We introduce PowerCodeBench, an execution-validated benchmark generator that pairs natural-language operator queries with pandapower code and numerical ground truth; an L0-L3 documentation-driven probing procedure that measures per-model API knowledge profiles; and a boundary-aware intervention that combines query-side API demand estimation with targeted proactive documentation injection and routed reactive correction. On a 2,000-task frozen release, we evaluate ten open-weight LLMs (1.5B-480B parameters) and four commercial mid-tier APIs. The intervention improves every evaluated open-weight model of at least 7B parameters and every commercial API by 32 to 56 accuracy points. Open-weight models in the 70B-120B range match the commercial mid-tier accuracy range, while Llama-3.1-405B and Qwen3-Coder-480B lead the panel. The targeted prompts preserve the full-context accuracy ceiling while using 41% of the prompt-token cost. The result is an accuracy-side, deployment-time path toward reliable on-premise LLM assistance for grid-analysis workflows without fine-tuning or cloud inference.

翻译：大语言模型（LLM）正日益被用于自动化电力系统分析，但许多公用事业公司和能源研究实验室出于保密性、合规性、可复现性和成本考量，需要采用本地化部署。这使得开放权重模型的可靠性成为一个部署难题。我们首先证明，电力系统代码生成中的初次失败并非仅由推理能力不足主导，而是源于结构化的API知识边界错误：在版本化的仿真库中出现虚构函数名、参数误用以及结果表格处理失当等问题。为此，我们提出了PowerCodeBench——一个经执行验证的基准测试生成器，它能够将自然语言算子查询与pandapower代码及数值基准真相进行配对；一套L0-L3级别的文档驱动探测流程，用于测量每个模型的API知识分布；以及一种边界感知干预机制，将查询端的API需求估计与有针对性的主动文档注入和路由式纠错相结合。在包含2000个任务的冻结版本上，我们评估了十个开放权重LLM（参数规模1.5B至480B）以及四个商业中端API。该干预措施使得所有参数大于等于7B的经评估开放权重模型以及所有商业API的性能提升了32至56个准确率百分点。参数规模在70B至120B之间的开放权重模型达到了与商业中端API相当的准确率水平，而Llama-3.1-405B与Qwen3-Coder-480B在整体评估中表现最优。所提出的目标提示词在仅使用41%的提示词成本的情况下，保持了全上下文准确率的上限。该成果为在电网分析工作流中实现无需微调或云端推理的可靠本地化LLM辅助，提供了一条兼顾准确率与部署效率的可行路径。