Context: AI-assisted tools are increasingly integrated into software development workflows, but their reliance on large language models (LLMs) introduces substantial computational and energy costs. Understanding and reducing the energy footprint of LLM inference is therefore essential for sustainable software development. Objective: In this study, we conduct a phase-level analysis of LLM inference energy consumption, distinguishing between the (1) prefill, where the model processes the input and builds internal representations, and (2) decoding, where output tokens are generated using the stored state. Method: We investigate six 6B-7B and four 3B-4B transformer-based models, evaluating them on code-centric benchmarks HumanEval for code generation and LongBench for code understanding. Results: Our findings show that, within both parameter groups, models exhibit distinct energy patterns across phases. Furthermore, we observed that increases in prefill cost amplify the energy cost per token during decoding, with amplifications ranging from 1.3% to 51.8% depending on the model. Lastly, three out of ten models demonstrate babbling behavior, adding excessive content to the output that unnecessarily inflates energy consumption. We implemented babbling suppression for code generation, achieving energy savings ranging from 44% to 89% without affecting generation accuracy. Conclusion: These findings show that prefill costs influence decoding, which dominates energy consumption, and that babbling suppression can yield up to 89% energy savings. Reducing inference energy therefore requires both mitigating babbling behavior and limiting impact of prefill on decoding.
翻译:背景:人工智能辅助工具正日益融入软件开发工作流,但其对大型语言模型(LLMs)的依赖带来了巨大的计算与能源成本。因此,理解并降低LLM推理的能源足迹对于可持续软件开发至关重要。目标:本研究对LLM推理能耗进行了阶段级分析,区分了(1)预填充阶段(模型处理输入并构建内部表征)与(2)解码阶段(利用存储状态生成输出标记)。方法:我们研究了六个6B-7B参数和四个3B-4B参数的基于Transformer的模型,并在以代码为中心的基准测试上进行了评估:用于代码生成的HumanEval和用于代码理解的LongBench。结果:我们的发现表明,在这两个参数组内部,模型在不同阶段均表现出独特的能耗模式。此外,我们观察到预填充成本的增加会放大解码阶段每个标记的能源成本,放大比例从1.3%到51.8%不等,具体取决于模型。最后,十个模型中有三个表现出“胡言乱语”行为,即在输出中添加过多不必要的内容,从而不必要地增加了能耗。我们针对代码生成任务实现了胡言乱语抑制,在未影响生成准确性的前提下实现了44%至89%的节能效果。结论:这些发现表明,预填充成本会影响主导能耗的解码阶段,而胡言乱语抑制最高可节省89%的能源。因此,降低推理能耗既需要抑制胡言乱语行为,也需要限制预填充对解码阶段的影响。