A Quasi-Experimental Developer Study of Security Training in LLM-Assisted Web Application Development

This paper presents a controlled quasi-experimental developer study examining whether a layer-based security training package is associated with improved security quality in LLM-assisted implementation of an identity-centric Java Spring Boot backend. The study uses a mixed design with a within-subject pre-training versus post-training comparison and an exploratory between-subject expertise factor. Twelve developers completed matched runs under a common interface, fixed model configuration, counterbalanced task sets, and a shared starter project. Security outcomes were assessed via independent manual validation of submitted repositories by the first and second authors. The primary participant-level endpoint was a severity-weighted validated-weakness score. The post-training condition showed a significant paired reduction under an exact Wilcoxon signed-rank test ($p = 0.0059$). In aggregate, validated weaknesses decreased from 162 to 111 (31.5\%), the severity-weighted burden decreased from 432 to 267 (38.2\%), and critical findings decreased from 24 to 5 (79.2\%). The largest reductions were in authorization and object access (53.3\%) and in authentication, credential policy, and recovery weaknesses (44.7\%). Session and browser trust-boundary issues showed minimal change, while sensitive-data and cryptographic weaknesses showed only marginal improvement. These results suggest that, under the tested conditions, post-training runs reduce validated security burden in LLM-assisted backend development without modifying the model. They do not support replacing secure defaults, static analysis, expert review, or operational hardening.

翻译：本论文通过一项受控的准实验开发者研究，考察了基于层次的安全培训包是否与大语言模型辅助开发的身份导向型Java Spring Boot后端的安全质量提升相关联。研究采用混合设计，包含受试者内培训前与培训后的比较，以及探索性的受试者间专业水平因素。12名开发者在统一接口、固定模型配置、平衡任务集及共享起始项目的条件下完成了匹配运行。安全结果通过第一作者与第二作者对提交代码仓库的独立人工验证进行评估。主要参与者级终点为严重程度加权的已验证弱点得分。精确Wilcoxon符号秩检验显示，培训后状态存在显著配对减少（$p = 0.0059$）。整体上，已验证弱点从162个降至111个（降幅31.5%），严重程度加权负担从432降至267（降幅38.2%），关键发现项从24个降至5个（降幅79.2%）。降幅最大的领域为授权与对象访问（53.3%）以及认证、凭证策略与恢复弱点（44.7%）。会话与浏览器信任边界问题变化极小，而敏感数据与加密弱点仅呈现边际改善。这些结果表明，在测试条件下，培训后运行在不修改模型的情况下，减少了大语言模型辅助后端开发中的已验证安全负担。但该结果不支持替代安全默认设置、静态分析、专家审查或运行加固措施。