AI-enabled systems are subjected to various types of runtime uncertainties, ranging from dynamic workloads, resource requirements, model drift, etc. These uncertainties have a big impact on the overall Quality of Service (QoS). This is particularly true in the case of Language Model (LM) enabled systems where the autoregressive nature of token generation introduces variability in latency, energy usage and response quality. These systems, powered by LLMs, are either resource-intensive (if run on-prem) or raise privacy/cost concerns (if leveraged using APIs). While deploying a Small Language Model (SLM) can be resource-efficient, it often falls short in addressing the diversity and scale of real-world requirements. To this, we argue that, rather than relying on any one SLM, leveraging a coordinated fleet of SLMs, each with specialized strengths can enable systems to dynamically adapt to shifting contexts and workload patterns. However, realizing the full potential of such an approach demands intelligent orchestration and continuous adaptation. To this end, we introduce CALM , a self-adaptive orchestration mechanism based on MAPE-K. Our approach continuously monitors user queries, analyzes the QoS metrics of the SLMs, identifies the optimal SLM to be used, routes the query to the identified SLM and further to enhance the effectiveness and efficiency, leverages caching and scheduling to decide the SLMs to be kept in memory. Our evaluation shows that CALM reduces latency by approximately 40% and energy consumption by 50%, while preserving domain-specific task performance when compared to single-LLM baselines.
翻译:人工智能赋能系统面临多种类型的运行时不确定性,包括动态工作负载、资源需求、模型漂移等。这些不确定性对整体服务质量(QoS)具有显著影响。在语言模型(LM)赋能系统中尤其如此,其中令牌生成的自回归特性会引入延迟、能耗及响应质量的波动性。这些由大语言模型(LLM)驱动的系统若在本地部署则资源消耗巨大,若通过API调用则存在隐私/成本隐患。虽然部署小型语言模型(SLM)具有资源效率优势,但其往往难以应对现实场景需求的多样性与规模。对此,我们认为相较于依赖单一SLM,协调利用具备各自专长能力的SLM集群,能使系统动态适应不断变化的上下文与工作负载模式。然而,充分发挥此类方法的潜力需要智能编排与持续适配能力。为此,我们提出CALM——一种基于MAPE-K的自适应编排机制。该方法持续监控用户查询,分析各SLM的QoS指标,识别最优SLM选择,将查询路由至目标SLM,并进一步通过缓存与调度策略优化内存驻留的SLM集合,从而提升系统效能与效率。实验评估表明,相较于单LLM基线,CALM在保持领域特定任务性能的同时,可降低约40%的延迟与50%的能耗。