Large Language Models (LLMs) have become an integral part of software development, especially with the advent of agentic capabilities. Yet, many frontier LLMs are affiliated with specific providers. This raises the question of whether generated code favors the provider's own ecosystem over comparable alternatives, potentially constraining developers' choices and increasing dependence on a single provider. We define this behavior as Vertical Integration Bias (VIB) and introduce \textsc{VIBench}, a benchmark for measuring VIB in direct and agentic code generation across $20$ provider-selectable software-integration scenarios. Evaluating $10$ frontier provider-affiliated models against $3$ non-affiliated controls, we find positive VIB in direct generation, with six of ten affiliated models showing statistically significant effects up to $+18.8$ percentage points (pp). Agentic workflows further amplify VIB, reaching $+39.2$ pp. Moreover, early affiliated-ecosystem choices in agentic workflows can persist into conceptually decoupled downstream files, with persistence as high as $90.3\%$. These findings underscore the need to measure and account for VIB in code generation, especially as agentic capabilities become more prevalent.
翻译:大型语言模型(LLM)已成为软件开发不可或缺的组成部分,尤其在智能体能力兴起之后。然而,许多前沿LLM与特定提供商存在关联。这引发了一个问题:生成的代码是否倾向于使用提供商自身生态系统而非可比替代方案,从而可能限制开发者的选择并增加对单一提供商的依赖。我们将此类行为定义为"垂直整合偏见"(VIB),并引入\textsc{VIBench}基准,用于在$20$个可提供商选择的软件集成场景中测量直接代码生成与智能体代码生成中的VIB。通过评估$10$个前沿提供商关联模型与$3$个非关联对照模型,我们发现直接代码生成中存在正向VIB,其中十个关联模型中有六个显示出具有统计学显著性的影响,最高达$+18.8$个百分点。智能体工作流进一步放大了VIB,达到$+39.2$个百分点。此外,智能体工作流中早期关联生态系统的选择可能持续存在于概念解耦的下游文件中,持续比例高达$90.3\%$。这些发现强调,在代码生成过程中测量并考量VIB至关重要,尤其在智能体能力日趋普及的背景下。