Generative AI (GenAI) has transformed applications in natural language processing and content creation, yet centralized inference remains hindered by high latency, limited customizability, and privacy concerns. Deploying large models (LMs) in mobile edge networks emerges as a promising solution. However, it also poses new challenges, including heterogeneous multi-modal LMs with diverse resource demands and inference speeds, varied prompt/output modalities that complicate orchestration, and resource-limited infrastructure ill-suited for concurrent LM execution. In response, we propose a Multi-Agentic AI framework for latency- and fairness-aware multi-modal LM inference in mobile edge networks. Our solution includes a long-term planning agent, a short-term prompt scheduling agent, and multiple on-node LM deployment agents, all powered by foundation language models. These agents cooperatively optimize prompt routing and LM deployment through natural language reasoning over runtime telemetry and historical experience. To evaluate its performance, we further develop a city-wide testbed that supports network monitoring, containerized LM deployment, intra-server resource management, and inter-server communications. Experiments demonstrate that our solution reduces average latency by over 80% and improves fairness (Normalized Jain index) to 0.90 compared to other baselines. Moreover, our solution adapts quickly without fine-tuning, offering a generalizable solution for optimizing GenAI services in edge environments.
翻译:生成式人工智能(GenAI)已彻底改变了自然语言处理与内容生成领域的应用,然而中心化推理仍受限于高延迟、低可定制性及隐私问题。将大模型(LMs)部署于移动边缘网络成为一种前景广阔的解决方案,但同时也带来了新的挑战:包括具有多样化资源需求与推理速度的异构多模态大模型、使编排复杂化的多样化提示/输出模态,以及难以支持并发大模型执行的资源受限基础设施。为此,我们提出一种面向移动边缘网络中延迟与公平感知的多模态大模型推理的多智能体AI框架。该方案包含一个长期规划智能体、一个短期提示调度智能体以及多个节点级大模型部署智能体,所有智能体均由基础语言模型驱动。这些智能体通过对运行时遥测数据与历史经验的自然语言推理,协同优化提示路由与大模型部署。为评估其性能,我们进一步构建了一个支持网络监控、容器化大模型部署、服务器内资源管理与服务器间通信的城市级测试平台。实验表明,相较于其他基线方法,我们的方案将平均延迟降低了80%以上,并将公平性指标(归一化Jain指数)提升至0.90。此外,该方案无需微调即可快速适应,为边缘环境中优化生成式人工智能服务提供了通用性解决方案。