Today's pursuit of a single Large Language Model (LMM) for all software engineering tasks is resource-intensive and overlooks the potential benefits of complementarity, where different models contribute unique strengths. However, the degree to which coding LLMs complement each other and the best strategy for maximizing an ensemble's potential are unclear, leaving practitioners without a clear path to move beyond single-model systems. To address this gap, we empirically compare ten individual LLMs from five families, and three ensembles of these LLMs across three software engineering benchmarks covering code generation and program repair. We assess the complementarity between models and the performance gap between the best individual model and the ensembles. Next, we evaluate various selection heuristics to identify correct solutions from an ensemble's candidate pool. We find that the theoretical upperbound for an ensemble's performance can be 83% above the best single model. Our results show that consensus-based strategies for selecting solutions fall into a "popularity trap," amplifying common but incorrect outputs. In contrast, a diversity-based strategy realizes up to 95% of this theoretical potential, and proves effective even in small two-model ensembles, enabling a cost-efficient way to enhance performance by leveraging multiple LLMs.
翻译:当前追求单一大型语言模型(LLM)以应对所有软件工程任务的做法不仅资源消耗巨大,而且忽视了互补性带来的潜在优势——即不同模型可贡献其独特能力。然而,编码大语言模型之间的互补程度究竟如何,以及最大化集成潜力的最佳策略仍不明确,这使得实践者缺乏超越单模型系统的清晰路径。为填补这一空白,我们通过实证研究比较了来自五个系列的十个独立大语言模型,以及这些模型在涵盖代码生成与程序修复的三个软件工程基准测试中的三种集成方式。我们评估了模型间的互补性,以及最佳单模型与集成模型之间的性能差距。随后,我们评估了多种选择启发式方法,以从集成候选池中识别正确解决方案。研究发现,集成模型性能的理论上限可比最佳单模型高出83%。结果表明,基于共识的解决方案选择策略会陷入“流行度陷阱”,放大常见但错误的输出;相比之下,基于多样性的策略可实现高达95%的这一理论潜力,并且即使在仅包含两个模型的小型集成中也证明有效,从而为通过利用多个大语言模型提升性能提供了一种高性价比的途径。