GreenServ: Energy-Efficient Context-Aware Dynamic Routing for Multi-Model LLM Inference

Large language models (LLMs) demonstrate remarkable capabilities, but their broad deployment is limited by significant computational resource demands, particularly energy consumption during inference. Static, one-model-fits-all inference strategies are often inefficient, as they do not exploit the diverse range of available models or adapt to varying query requirements. This paper presents GreenServ, a dynamic, context-aware routing framework that optimizes the trade-off between inference accuracy and energy efficiency. GreenServ extracts lightweight contextual features from each query, including task type, semantic cluster, and text complexity, and routes queries to the most suitable model from a heterogeneous pool, based on observed accuracy and energy usage. We employ a multi-armed bandit approach to learn adaptive routing policies online. This approach operates under partial feedback, eliminates the need for extensive offline calibration, and streamlines the integration of new models into the inference pipeline. We evaluated GreenServ across five benchmark tasks and a pool of 16 contemporary open-access LLMs. Experimental results show that GreenServ consistently outperforms static (single-model) and random baselines. In particular, compared to random routing, GreenServ achieved a 22% increase in accuracy while reducing cumulative energy consumption by 31%. Finally, we evaluated GreenServ with RouterBench, achieving an average accuracy of 71.7% with a peak accuracy of 75.7%. All artifacts are open-source and available as an anonymous repository for review purposes here: https://anonymous.4open.science/r/llm-inference-router-EBEA/README.md

翻译：大语言模型（LLM）展现出卓越的能力，但其广泛部署受到显著计算资源需求（尤其是推理过程中的能耗）的限制。静态的单一模型通用推理策略往往效率低下，因为它们既未充分利用现有模型的多样性，也无法适应多变的查询需求。本文提出GreenServ——一种动态的上下文感知路由框架，旨在优化推理准确性与能效之间的权衡。GreenServ从每个查询中提取轻量级上下文特征（包括任务类型、语义簇和文本复杂度），并根据观测到的准确性与能耗数据，将查询路由至异构模型池中最合适的模型。我们采用多臂赌博机方法在线学习自适应路由策略。该方法在部分反馈条件下运行，无需大量离线校准，并能简化新模型接入推理流程的过程。我们在五项基准任务和包含16个当代开源大语言模型的模型池中对GreenServ进行了评估。实验结果表明，GreenServ在各项指标上持续优于静态（单模型）和随机基线方法。特别地，与随机路由相比，GreenServ在实现准确率提升22%的同时，累计能耗降低了31%。最后，我们使用RouterBench对GreenServ进行评估，取得了平均准确率71.7%、峰值准确率75.7%的成绩。所有实验材料均已开源，可通过匿名仓库获取以供审阅：https://anonymous.4open.science/r/llm-inference-router-EBEA/README.md