Large language models (LLMs) are rapidly transforming how software is created and maintained. Comparing LLM-generated code against human-written standards is essential to determine whether these new tools uphold or erode the security baselines established by professional developers. Yet, we lack a standardized method for empirically comparing the security of code produced through human-LLM collaboration against LLM-only, or traditional human-only methods. To facilitate this, we propose an automated framework for conducting comparative studies across human-only, LLM-only, and hybrid conditions. Our approach automates the logging of prompts, timing, and experimental settings, measuring outcomes through multi-dimensional static and dynamic quality analysis. We provide an open-source implementation of this framework to ensure that future researchers can conduct reproducible, species-fair experiments. Importantly, we validate the framework via a feasibility study, providing an experimental blueprint for ``species-fair'' comparisons between human and AI subjects. By sharing lessons learned, we establish a foundation for empirical research on human and LLM-generated code for software security.
翻译:大语言模型(LLMs)正迅速改变软件的创建和维护方式。将LLM生成的代码与人类编写的标准进行比较,对于判断这些新工具是维护还是削弱了专业开发者建立的安全基线至关重要。然而,我们目前缺乏一种标准化的方法来实证比较人类-LLM协作产生的代码、纯LLM生成代码以及传统纯人类编写代码的安全性。为此,我们提出了一个自动化框架,用于在纯人类、纯LLM和混合条件之间进行对比研究。我们的方法自动化了提示词、计时和实验设置的记录,并通过多维度的静态和动态质量分析来衡量结果。我们提供了该框架的开源实现,以确保未来研究人员能够进行可重复、物种公平的实验。重要的是,我们通过一项可行性研究验证了该框架,为人类与AI主体之间的“物种公平”比较提供了实验蓝图。通过分享经验教训,我们为软件安全性方面的人类与LLM生成代码的实证研究奠定了基础。