As large language models (LLMs) become widely used, their environmental impact, especially carbon emission, has attracted more attention. Prior studies focus on compute-related carbon emissions. In this paper, we find that storage is another key contributor. LLM caching, which saves and reuses KV caches for repeated context, reduces operational carbon by avoiding redundant computation. However, this benefit comes at the cost of embodied carbon from high-capacity, high-speed SSDs. As LLMs scale, the embodied carbon of storage grows significantly. To address this tradeoff, we present GreenCache, a carbon-aware cache management framework that dynamically derives resource allocation plans for LLM serving. GreenCache analyzes the correlation between carbon emission and SLO satisfaction, reconfiguring the resource over time to keep the balance between SLO and carbon emission under dynamic workloads. Evaluations from real traces demonstrate that GreenCache achieves an average carbon reduction of 15.1 % when serving Llama-3 70B in the FR grid, with reductions reaching up to 25.3 %, while staying within latency constraints for > 90 % of requests.
翻译:随着大语言模型(LLM)的广泛应用,其对环境的影响,尤其是碳排放问题,已引起更多关注。以往研究主要关注计算相关的碳排放。本文发现,存储是另一个关键贡献因素。LLM缓存通过保存并重用重复上下文的KV缓存,避免了冗余计算,从而降低了运行碳足迹。然而,这一效益是以高容量、高速固态硬盘的隐含碳排放为代价的。随着LLM规模扩大,存储的隐含碳排放显著增加。为应对这一权衡,我们提出GreenCache——一种碳感知缓存管理框架,可动态制定LLM服务的资源分配方案。GreenCache分析碳排放与服务等级协议(SLO)满足程度之间的关联,根据动态工作负载实时调整资源配置,以平衡SLO与碳排放。基于真实流量轨迹的评估表明,在法国电网环境下为Llama-3 70B提供推理服务时,GreenCache平均减少15.1%的碳排放,最高可达25.3%,同时超过90%的请求满足延迟约束。