Correct Code, Vulnerable Dependencies: A Large Scale Measurement Study of LLM-Specified Library Versions

Large language models (LLMs) are now largely involved in software development workflows, and the code they generate routinely includes third-party library (TPL) imports annotated with specific version identifiers. These version choices can carry security and compatibility risks, yet they have not been systematically studied. We present the first large-scale measurement study of version-level risk in LLM-generated Python code, evaluating 10 LLMs on PinTrace, a curated benchmark of 1,000 Stack Overflow programming tasks. LLMs tend to specify version identifiers when directly prompted at 26.83%-95.18%, while down to 6.45%-59.19% in creating a manifest file directly. Among the specified versions, 36.70%-55.70% of tasks contain at least one known CVE, and 62.75%-74.51% of them carry Critical or High severity ratings. In 72.27%-91.37% of cases, the associated CVEs were publicly disclosed before the model's knowledge cutoff. The statistics show all models converge on the same small set of risky release versions, indicating a systemic bias rather than isolated model error. Static compatibility rates range from 19.70% to 63.20%, with installation failure as the dominant cause. The dynamic test cases confirm the pattern by 6.49%-48.62% pass rates. Further experiments confirm that these failures are attributable to version selection rather than code quality, and that externally anchored version constraints substantially reduce both vulnerability exposure and compatibility failures. Our findings reveal LLM version selection as a first-class, previously overlooked risk surface in LLM-based development. We disclosed these findings to the community of the evaluated models, and several confirmed the issue. All the code and dataset have been released for open science at https://github.com/dw763j/PinTrace.

翻译：大语言模型（LLMs）现已深度参与软件开发工作流，其生成的代码通常包含带有特定版本标识的第三方库（TPL）导入。这些版本选择可能带来安全与兼容性风险，然而尚未得到系统性研究。我们开展了首个针对LLM生成Python代码中版本级风险的大规模测量研究，在包含1000个Stack Overflow编程任务的基准测试集PinTrace上评估了10个LLM。直接提示时，LLM指定版本标识的比例为26.83%-95.18%，而直接生成清单文件时降至6.45%-59.19%。在指定版本中，36.70%-55.70%的任务涉及至少一个已知CVE，其中62.75%-74.51%的CVE被评定为严重或高危等级。在72.27%-91.37%的案例中，相关CVE在模型知识截止日期前已公开披露。统计显示所有模型趋向于同一小范围危险版本发布，表明存在系统性偏差而非孤立模型错误。静态兼容性比率从19.70%到63.20%不等，安装失败是主要原因。动态测试用例通过率（6.49%-48.62%）进一步印证了该模式。后续实验证实这些失败归因于版本选择而非代码质量，且外部锚定的版本约束显著降低了漏洞暴露与兼容性故障。我们的发现揭示了LLM版本选择是LLM开发中一个此前被忽视的一级风险面。我们已向受评估模型社区披露了这些发现，部分社区已确认该问题。所有代码和数据集已在https://github.com/dw763j/PinTrace上开源发布供开放科学研究。