Large Language Models (LLMs) have become increasingly prevalent in cloud-based platforms, propelled by the introduction of AI-based consumer and enterprise services. LLM inference requests in particular account for up to 90% of total LLM lifecycle energy use, dwarfing training energy costs. The rising volume of LLM inference requests is increasing environmental footprints, particularly carbon emissions and water consumption. To improve sustainability for LLM inference serving in cloud datacenter environments, we propose a novel multi-agent game-theoretic reinforcement learning framework called MARLIN to co-optimize time-to-first token (TTFT), carbon emissions, water usage, and energy costs associated with LLM inference. MARLIN demonstrates a reduction of at least 18% in TTFT, 33% in carbon emissions, 43% in water usage, and 11% in energy costs compared to state-of-the-art LLM inference management frameworks.
翻译:大语言模型(LLM)在基于AI的消费级和企业级服务推动下,于云平台中的应用日益广泛。其中,LLM推理请求占LLM全生命周期能耗的90%,远超训练阶段的能源消耗。持续增长的LLM推理请求正在加剧环境足迹,尤其体现在碳排放与水资源的消耗上。为提升云数据中心环境中LLM推理服务的可持续性,我们提出了一种新颖的多智能体博弈强化学习框架MARLIN,旨在协同优化LLM推理相关的首token延迟(TTFT)、碳排放、用水量及能源成本。与当前最先进的LLM推理管理框架相比,MARLIN在TTFT、碳排放、用水量和能源成本上分别实现了至少18%、33%、43%和11%的降低。