Large language models (LLMs) are increasingly integrated into software development workflows, yet they often introduce subtle logic or data-misuse errors that differ from human bugs. To study how these two error types interact, we construct Tricky$^2$, a hybrid dataset that augments the existing TrickyBugs corpus of human-written defects with errors injected by both GPT-5 and OpenAI-oss-20b across C++, Python, and Java programs. Our approach uses a taxonomy-guided prompting framework to generate machine-originated bugs while preserving original human defects and program structure. The resulting corpus spans human-only, LLM-only, and human+LLM splits, enabling analysis of mixed-origin error behavior, multi-bug repair robustness, and reliability in hybrid human-machine code. This paper outlines the dataset construction pipeline and illustrates its use through small-scale baseline evaluations of classification, localization, and repair tasks.
翻译:大型语言模型(LLM)正日益融入软件开发工作流,但它们常引入与人类错误不同的微妙逻辑或数据误用错误。为研究这两类错误如何交互,我们构建了Tricky$^2$——一个混合数据集,该数据集通过GPT-5和OpenAI-oss-20b在C++、Python和Java程序中注入错误,从而扩展了现有的人类编写缺陷数据集TrickyBugs。我们的方法采用分类引导的提示框架来生成机器源错误,同时保留原始人类缺陷与程序结构。最终构建的语料库涵盖纯人类错误、纯LLM错误及人类+LLM混合错误三个子集,支持对混合来源错误行为、多缺陷修复鲁棒性以及人机混合代码可靠性的分析。本文详细阐述了数据集构建流程,并通过分类、定位与修复任务的小规模基线评估展示了其应用场景。