AI-based coding agents are increasingly integrated into software development workflows, collaborating with developers to create pull requests (PRs). Despite their growing adoption, the role of human-agent collaboration in software testing remains poorly understood. This paper presents an empirical study of 6,582 human-agent PRs (HAPRs) and 3,122 human PRs (HPRs) from the AIDev dataset. We compare HAPRs and HPRs along three dimensions: (i) testing frequency and extent, (ii) types of testing-related changes (code-and-test co-evolution vs. test-focused), and (iii) testing quality, measured by test smells. Our findings reveal that, although the likelihood of including tests is comparable (42.9% for HAPRs vs. 40.0% for HPRs), HAPRs exhibit a larger extent of testing, nearly doubling the test-to-source line ratio found in HPRs. While test-focused task distributions are comparable, HAPRs are more likely to add new tests during co-evolution (OR=1.79), whereas HPRs prioritize modifying existing tests. Finally, although some test smell categories differ statistically, negligible effect sizes suggest no meaningful differences in quality. These insights provide the first characterization of how human-agent collaboration shapes testing practices.
翻译:基于人工智能的编码助手正日益融入软件开发工作流,与开发者协作创建拉取请求(PRs)。尽管其采用率持续增长,人机协作在软件测试中的作用仍鲜为人知。本文基于AIDev数据集中的6,582个人机协作PR(HAPR)与3,122个人主导PR(HPR)开展实证研究。我们从三个维度比较HAPR与HPR:(i)测试频率与覆盖范围,(ii)测试相关变更的类型(代码-测试协同演化与测试专项修改),以及(iii)以测试坏味道衡量的测试质量。研究发现:虽然包含测试的可能性相近(HAPR为42.9% vs. HPR为40.0%),但HAPR展现出更大的测试覆盖范围,其测试代码与源代码行数比值接近HPR的两倍。尽管测试专项任务分布相似,HAPR在协同演化过程中更倾向于新增测试(优势比=1.79),而HPR则优先修改现有测试。最后,虽然部分测试坏味道类别存在统计差异,但可忽略的效应量表明两者在质量上并无实质性区别。这些发现首次揭示了人机协作如何塑造测试实践的特征。