The energy demand of modern cloud services, particularly those related to generative AI, is increasing at an unprecedented pace. To date, carbon-aware computing strategies have primarily focused on batch process scheduling or geo-distributed load balancing. However, such approaches are not applicable to services that require constant availability at specific locations due to latency, privacy, data, or infrastructure constraints. In this paper, we explore how the carbon footprint of energy-intensive services can be reduced by adjusting the fraction of requests served by different service quality tiers. We show that adapting this quality of responses with respect to grid carbon intensity can lead to additional carbon savings beyond resource and energy efficiency. Building on this, we introduce a forecast-based multi-horizon optimization that reaches close-to-optimal carbon savings and is able to automatically adapt service quality for best-effort users to stay within an annual carbon budget. Our approach can reduce the emissions of large-scale LLM services, which we estimate at multiple 10,000 tons of CO2 annually, by up to 10%.
翻译:现代云服务,特别是与生成式人工智能相关的服务,其能源需求正以前所未有的速度增长。迄今为止,碳感知计算策略主要集中于批处理调度或地理分布式负载均衡。然而,对于因延迟、隐私、数据或基础设施限制而需要在特定位置持续可用的服务,此类方法并不适用。本文探讨了如何通过调整由不同服务质量层级处理的请求比例,来降低高能耗服务的碳足迹。我们证明,根据电网碳强度自适应调整响应质量,能够在资源和能源效率之外实现额外的碳减排。在此基础上,我们提出了一种基于预测的多时间尺度优化方法,该方法能达到接近最优的碳减排效果,并能自动调整尽力而为用户的服务质量,以确保其年度碳预算不超支。我们的方法可将大规模LLM服务的排放量(我们估计其年排放量可达数万吨二氧化碳)降低高达10%。