The rapid adoption of large language models (LLMs) has raised concerns about their substantial energy consumption, especially when deployed at industry scale. While several techniques have been proposed to address this, limited empirical evidence exists regarding the effectiveness of applying them to LLM-based industry applications. To fill this gap, we analyzed a chatbot application in an industrial context at Schuberg Philis, a Dutch IT services company. We then selected four techniques, namely Small and Large Model Collaboration, Prompt Optimization, Quantization, and Batching, applied them to the application in eight variations, and then conducted experiments to study their impact on energy consumption, accuracy, and response time compared to the unoptimized baseline. Our results show that several techniques, such as Prompt Optimization and 2-bit Quantization, managed to reduce energy use significantly, sometimes by up to 90%. However, these techniques especially impacted accuracy negatively, to a degree that is not acceptable in practice. The only technique that achieved significant and strong energy reductions without harming the other qualities substantially was Small and Large Model Collaboration via Nvidia's Prompt Task and Complexity Classifier (NPCC) with prompt complexity thresholds. This highlights that reducing the energy consumption of LLM-based applications is not difficult in practice. However, improving their energy efficiency, i.e., reducing energy use without harming other qualities, remains challenging. Our study provides practical insights to move towards this goal.
翻译:大型语言模型(LLM)的快速普及引发了对其巨大能耗的担忧,尤其是在工业规模部署时。尽管已有多种技术被提出以应对此问题,但关于将这些技术应用于基于LLM的工业应用的实际效果,目前仍缺乏充分的实证证据。为填补这一空白,我们分析了荷兰IT服务公司Schuberg Philis在工业场景下的一个聊天机器人应用。随后,我们选取了四种技术——即大小模型协作、提示优化、量化和批处理,以八种变体形式应用于该应用,并通过实验研究了它们相较于未优化基线在能耗、准确性和响应时间方面的影响。我们的结果表明,多项技术(如提示优化和2位量化)能够显著降低能耗,有时降幅高达90%。然而,这些技术尤其会对准确性产生负面影响,其程度在实际应用中难以接受。唯一能在不显著损害其他性能的前提下实现显著且强劲节能效果的技术,是通过英伟达的提示任务与复杂度分类器(NPCC)结合提示复杂度阈值实现的大小模型协作。这突显出,在实践中降低基于LLM的应用的能耗并不困难,但提升其能效——即在减少能耗的同时不损害其他性能——仍然具有挑战性。本研究为实现这一目标提供了实践见解。