Assessing Small Language Models for Code Generation: An Empirical Study with Benchmarks

The recent advancements of Small Language Models (SLMs) have opened new possibilities for efficient code generation. SLMs offer lightweight and cost-effective alternatives to Large Language Models (LLMs), making them attractive for use in resource-constrained environments. However, empirical understanding of SLMs, particularly their capabilities, limitations, and performance trade-offs in code generation remains limited. This study presents a comprehensive empirical evaluation of 20 open-source SLMs ranging from 0.4B to 10B parameters on five diverse code-related benchmarks (HumanEval, MBPP, Mercury, HumanEvalPack, and CodeXGLUE). The models are assessed along three dimensions: i) functional correctness of generated code, ii) computational efficiency and iii) performance across multiple programming languages. The findings of this study reveal that several compact SLMs achieve competitive results while maintaining a balance between performance and efficiency, making them viable for deployment in resource-constrained environments. However, achieving further improvements in accuracy requires switching to larger models. These models generally outperform their smaller counterparts, but they require much more computational power. We observe that for 10% performance improvements, models can require nearly a 4x increase in VRAM consumption, highlighting a trade-off between effectiveness and scalability. Besides, the multilingual performance analysis reveals that SLMs tend to perform better in languages such as Python, Java, and PHP, while exhibiting relatively weaker performance in Go, C++, and Ruby. However, statistical analysis suggests these differences are not significant, indicating a generalizability of SLMs across programming languages. Based on the findings, this work provides insights into the design and selection of SLMs for real-world code generation tasks.

翻译：小型语言模型（SLMs）的最新进展为高效代码生成开辟了新的可能性。相较于大型语言模型（LLMs），SLMs提供了轻量级且经济高效的替代方案，使其在资源受限环境中具有吸引力。然而，目前对SLMs在代码生成中的能力、局限性和性能权衡的实证理解仍然有限。本研究对20个参数量在0.4B至10B之间的开源SLMs进行了全面的实证评估，测试覆盖五个多样化的代码相关基准（HumanEval、MBPP、Mercury、HumanEvalPack和CodeXGLUE）。评估从三个维度展开：i)生成代码的功能正确性，ii)计算效率，以及iii)跨多种编程语言的性能表现。研究结果表明，多个紧凑型SLMs在保持性能与效率平衡的同时取得了具有竞争力的结果，使其在资源受限环境中具备实际部署的可行性。然而，若要进一步提升准确性，则需要转向使用更大规模的模型。这些较大模型通常优于较小模型，但需要消耗更多的计算资源。我们观察到，为获得10%的性能提升，模型可能需要增加近4倍的显存消耗，这凸显了效能与可扩展性之间的权衡。此外，多语言性能分析表明，SLMs在Python、Java和PHP等语言中表现较好，而在Go、C++和Ruby中表现相对较弱。但统计分析显示这些差异并不显著，表明SLMs在不同编程语言间具有一定的泛化能力。基于以上发现，本研究为实际代码生成任务中SLMs的设计与选择提供了重要见解。