It is a common belief that large language models (LLMs) are better than smaller-sized ones. However, larger models also require significantly more time and compute during inference. This begs the question: what happens when both models operate under the same budget? (e.g., compute, run-time). To address this question, we analyze code generation LLMs of various sizes and make comparisons such as running a 70B model once vs. generating five outputs from a 13B model and selecting one. Our findings reveal that, in a standard unit-test setup, the repeated use of smaller models can yield consistent improvements, with gains of up to 15% across five tasks. On the other hand, in scenarios where unit-tests are unavailable, a ranking-based selection of candidates from the smaller model falls short of the performance of a single output from larger ones. Our results highlight the potential of using smaller models instead of larger ones, and the importance of studying approaches for ranking LLM outputs.
翻译:普遍认为,大型语言模型(LLMs)优于较小规模的模型。然而,大型模型在推理过程中需要显著更多的时间和算力。这就引发了一个问题:当两者在相同预算(例如算力、运行时间)下运行时会发生什么?为探究此问题,我们分析不同规模的代码生成大语言模型,并进行比较,例如运行一次70B模型与从13B模型生成五个输出并从中选择一个。研究结果表明,在标准单元测试设置下,重复使用较小模型可带来持续改进,在五个任务中增益高达15%。另一方面,在无法使用单元测试的场景中,基于排序从较小模型候选中进行选择的效果,不及直接输出单个结果的大型模型。我们的结果凸显了使用较小模型替代大型模型的潜力,以及研究大语言模型输出排序方法的重要性。