GreenServ: Energy-Efficient Context-Aware Dynamic Routing for Multi-Model LLM Inference

Large language models (LLMs) demonstrate remarkable capabilities, but their broad deployment is limited by significant computational resource demands, particularly energy consumption during inference. Static, one-model-fits-all inference strategies are often inefficient, as they do not exploit the diverse range of available models or adapt to varying query requirements. This paper presents GreenServ, a dynamic, context-aware routing framework that optimizes the trade-off between inference accuracy and energy efficiency. GreenServ extracts lightweight contextual features from each query, including task type, semantic cluster, and text complexity, and routes queries to the most suitable model from a heterogeneous pool, based on observed accuracy and energy usage. We employ a multi-armed bandit approach to learn adaptive routing policies online. This approach operates under partial feedback, eliminates the need for extensive offline calibration, and streamlines the integration of new models into the inference pipeline. We evaluated GreenServ across five benchmark tasks and a pool of 16 contemporary open-access LLMs. Experimental results show that GreenServ consistently outperforms static (single-model) and random baselines. In particular, compared to random routing, GreenServ achieved a 22% increase in accuracy while reducing cumulative energy consumption by 31%. Finally, we evaluated GreenServ with RouterBench, achieving an average accuracy of 71.7% with a peak accuracy of 75.7%. All artifacts are open-source and available here: \href{https://github.com/TZData1/llm-inference-router}{GitHub}

翻译：大语言模型（LLMs）展现出卓越的能力，但其广泛部署受到显著计算资源需求（特别是推理过程中的能耗）的限制。静态的“一刀切”推理策略往往效率低下，因为它们既未充分利用可用模型的多样性，也未适应多变的查询需求。本文提出GreenServ，一种动态的上下文感知路由框架，旨在优化推理准确性与能源效率之间的权衡。GreenServ从每个查询中提取轻量级上下文特征（包括任务类型、语义聚类和文本复杂度），并根据观测到的准确性与能耗，将查询路由至异构模型池中最合适的模型。我们采用多臂赌博机方法在线学习自适应路由策略。该方法在部分反馈下运行，无需大量离线校准，并能简化新模型融入推理流程的过程。我们在五个基准任务和包含16个当代开源大语言模型的池中对GreenServ进行了评估。实验结果表明，GreenServ持续优于静态（单模型）和随机基线方法。具体而言，与随机路由相比，GreenServ在将累计能耗降低31%的同时，实现了22%的准确率提升。最后，我们使用RouterBench对GreenServ进行评估，获得了71.7%的平均准确率和75.7%的峰值准确率。所有相关资源均已开源，可通过以下链接获取：\href{https://github.com/TZData1/llm-inference-router}{GitHub}