Pull request (PR) descriptions generated by AI coding agents are the primary channel for communicating code changes to human reviewers. However, the alignment between these messages and the actual changes remains unexplored, raising concerns about the trustworthiness of AI agents. To fill this gap, we analyzed 23,247 agentic PRs across five agents using PR message-code inconsistency (PR-MCI). We contributed 974 manually annotated PRs, found 406 PRs (1.7%) exhibited high PR-MCI, and identified eight PR-MCI types, revealing that descriptions claiming unimplemented changes was the most common issue (45.4%). Statistical tests confirmed that high-MCI PRs had 51.7% lower acceptance rates (28.3% vs. 80.0%) and took 3.5x longer to merge (55.8 vs. 16.0 hours). Our findings suggest that unreliable PR descriptions undermine trust in AI agents, highlighting the need for PR-MCI verification mechanisms and improved PR generation to enable trustworthy human-AI collaboration.
翻译:由AI编码助手生成的拉取请求(PR)描述是向人类评审者传达代码变更的主要渠道。然而,这些消息与实际变更之间的一致性尚未得到充分研究,这引发了人们对AI助手可信度的担忧。为填补这一空白,我们使用PR消息-代码不一致性(PR-MCI)指标,对五种AI助手生成的23,247个自动化PR进行了分析。我们贡献了974个手动标注的PR,发现406个PR(1.7%)表现出高PR-MCI,并识别出八种PR-MCI类型,其中描述声称实现未实际完成的变更是最常见的问题(45.4%)。统计检验证实,高MCI的PR接受率降低51.7%(28.3% vs. 80.0%),合并时间延长3.5倍(55.8 vs. 16.0小时)。我们的研究结果表明,不可靠的PR描述会削弱对AI助手的信任,这凸显了建立PR-MCI验证机制和改进PR生成方法的必要性,以实现可信的人机协作。