AI coding agents increasingly modify real software repositories and make dependency decisions, including adding, removing, or updating third-party packages. These choices can materially affect security posture and maintenance burden, yet repository-level evaluations largely emphasize test passing and executability without explicitly scoring whether systems (i) reuse existing dependencies, (ii) avoid unnecessary additions, or (iii) select versions that satisfy security and policy constraints. We propose DepDec-Bench, a benchmark for evaluating dependency decision-making beyond functional correctness. To ground DepDec-Bench in real-world behavior, we conduct a preliminary study of 117,062 dependency changes from agent- and human-authored pull requests across seven ecosystems. We show that coding agents frequently make dependency decisions with security consequences that remain invisible to test-focused evaluation: agents select PR-time known-vulnerable versions (2.46%) and exhibit net-negative security impact overall (net impact -98 vs. +1,316 for humans). These observations inform DepDec-Bench task families and metrics that evaluate safe version selection, reuse discipline, and restraint against dependency bloat alongside test passing.
翻译:人工智能编码代理日益频繁地修改实际软件仓库并做出依赖决策,包括添加、移除或更新第三方软件包。这些选择可能实质性地影响安全态势和维护负担,然而仓库层面的评估主要强调测试通过率和可执行性,并未明确评估系统是否:(i) 复用现有依赖项,(ii) 避免不必要的添加,或(iii) 选择满足安全性和策略约束的版本。我们提出DepDec-Bench,这是一个超越功能正确性、专门评估依赖决策能力的基准。为使DepDec-Bench立足于真实世界行为,我们对来自七个生态系统中由代理和人工提交的拉取请求所涉及的117,062个依赖变更进行了初步研究。研究表明,编码代理经常做出具有安全影响的依赖决策,而这些影响在专注于测试的评估中不可见:代理会选择在拉取请求提交时已知存在漏洞的版本(2.46%),并且总体上表现出净负面的安全影响(净影响为-98,而人类为+1,316)。这些观察结果为DepDec-Bench的任务系列和评估指标提供了依据,这些指标在测试通过率之外,还评估安全版本选择、复用规范以及对依赖膨胀的克制能力。