The use of generative AI-based coding assistants like ChatGPT and Github Copilot is a reality in contemporary software development. Many of these tools are provided as remote APIs. Using third-party APIs raises data privacy and security concerns for client companies, which motivates the use of locally-deployed language models. In this study, we explore the trade-off between model accuracy and energy consumption, aiming to provide valuable insights to help developers make informed decisions when selecting a language model. We investigate the performance of 18 families of LLMs in typical software development tasks on two real-world infrastructures, a commodity GPU and a powerful AI-specific GPU. Given that deploying LLMs locally requires powerful infrastructure which might not be affordable for everyone, we consider both full-precision and quantized models. Our findings reveal that employing a big LLM with a higher energy budget does not always translate to significantly improved accuracy. Additionally, quantized versions of large models generally offer better efficiency and accuracy compared to full-precision versions of medium-sized ones. Apart from that, not a single model is suitable for all types of software development tasks.
翻译:基于生成式人工智能的代码助手(如ChatGPT和Github Copilot)在当代软件开发中已成为现实。许多此类工具以远程API形式提供。使用第三方API会引发客户公司的数据隐私与安全问题,这促使了本地部署语言模型的需求。本研究探讨模型准确性与能耗之间的权衡关系,旨在为开发者在选择语言模型时提供有价值的参考依据。我们在两种真实基础设施(商用GPU与专用AI高性能GPU)上,针对典型软件开发任务评估了18个系列大型语言模型的性能。考虑到本地部署LLM需要昂贵的基础设施且非所有用户都能负担,我们同时考察了全精度模型与量化模型。研究结果表明:采用高能耗预算的大型LLM并不总能带来准确性的显著提升;此外,大型模型的量化版本通常比中等规模模型的全精度版本具有更优的能效与准确性。值得注意的是,没有任何单一模型能适用于所有类型的软件开发任务。