The rising demand for Large Language Model (LLM) inference services has intensified pressure on computational resources, resulting in latency and cost challenges. This paper introduces a novel routing algorithm based on the Non-dominated Sorting Genetic Algorithm II (NSGA-II) to distribute inference requests across heterogeneous LLM instances in a cloud-edge computing environment. Formulated as a multi-objective optimization problem, the algorithm balances response quality, response time, and inference cost, adapting to request heterogeneity (e.g., varying complexity and prompt lengths) and node diversity (e.g., edge vs. cloud resources). This adaptive routing algorithm optimizes performance under dynamic workloads. We benchmark the approach using a testbed with datasets including Stanford Question Answering Dataset (SQuAD), Mostly Basic Python Problems (MBPP), Hella Situations With Adversarial Generations (HellaSwag), and Grade School Math 8K (GSM8K). Experimental results show our solution, compared to the baselines, preserves 95.2% of Cloud-Only response quality with slight latency increase, while reducing inference cost by 34.9%. These findings validate the algorithm's effectiveness for scalable LLM deployments.
翻译:大型语言模型(LLM)推理服务需求的增长加剧了计算资源的压力,导致了延迟和成本方面的挑战。本文提出了一种基于非支配排序遗传算法II(NSGA-II)的新型路由算法,用于在云边计算环境中将推理请求分配到异构的LLM实例上。该算法被构建为一个多目标优化问题,旨在平衡响应质量、响应时间和推理成本,并能适应请求的异构性(例如,不同的复杂性和提示长度)和节点多样性(例如,边缘与云资源)。这种自适应路由算法在动态工作负载下优化了系统性能。我们使用一个包含斯坦福问答数据集(SQuAD)、基础Python问题集(MBPP)、对抗生成复杂情境数据集(HellaSwag)和小学数学8K数据集(GSM8K)的测试平台对该方法进行了基准测试。实验结果表明,与基线方法相比,我们的解决方案在仅略微增加延迟的情况下,保持了云端专用方案95.2%的响应质量,同时将推理成本降低了34.9%。这些发现验证了该算法对于可扩展LLM部署的有效性。