Pull request (PR) descriptions generated by AI coding agents are the primary channel for communicating code changes to human reviewers. However, the alignment between these messages and the actual changes remains unexplored, raising concerns about the trustworthiness of AI agents. To fill this gap, we analyzed 23,247 agentic PRs across five agents using PR message-code inconsistency (PR-MCI). We contributed 974 manually annotated PRs, found 406 PRs (1.7%) exhibited high PR-MCI, and identified eight PR-MCI types, revealing that "descriptions claim unimplemented changes" was the most common issue (45.4%). Statistical tests confirmed that high-MCI PRs had 51.7% lower acceptance rates (28.3% vs. 80.0%) and took 3.5 times longer to merge (55.8 vs. 16.0 hours). Our findings suggest that unreliable PR descriptions undermine trust in AI agents, highlighting the need for PR-MCI verification mechanisms and improved PR generation to enable trustworthy human-AI collaboration.
翻译:由AI编程助手生成的拉取请求(PR)描述是向人类评审者传达代码变更的主要渠道。然而,这些描述信息与实际变更之间的一致性尚未得到充分研究,这引发了人们对AI助手可信度的担忧。为填补这一空白,我们使用PR消息-代码不一致性(PR-MCI)指标,对五种AI助手生成的23,247个自主PR进行了分析。我们贡献了974个手动标注的PR,发现406个PR(1.7%)表现出高PR-MCI,并识别出八种PR-MCI类型,其中“描述声称未实现的变更”是最常见的问题(45.4%)。统计检验证实,高MCI的PR接受率降低了51.7%(28.3%对比80.0%),合并时间延长了3.5倍(55.8小时对比16.0小时)。我们的研究结果表明,不可靠的PR描述会削弱对AI助手的信任,这凸显了建立PR-MCI验证机制和改进PR生成方法的必要性,以实现可信赖的人机协作。