Foundation models (FMs) unlock unprecedented multimodal and multitask intelligence, yet their cloud-centric deployment precludes real-time responsiveness and compromises user privacy. Meanwhile, monolithic execution at the edge remains infeasible under stringent resource limits and uncertain network dynamics. To bridge this gap, we propose a microservice-based FM inference framework that exploits the intrinsic functional asymmetry between heavyweight core services and agile light services. Our two-tier deployment strategy ensures robust Quality of Service (QoS) under resource contention. Specifically, core services are placed statically via a long-term network-aware integer program with sparsity constraints to form a fault-tolerant backbone. On the other hand, light services are orchestrated dynamically by a low-complexity online controller that integrates effective capacity theory with Lyapunov optimization, providing probabilistic latency guarantees under real-time workload fluctuations. Simulations demonstrate that our framework achieves over 84% average on-time task completion with moderate deployment costs and maintains strong robustness as the system load scales.
翻译:基础模型(FMs)开启了前所未有的多模态与多任务智能,但其以云为中心的部署方式阻碍了实时响应能力并损害了用户隐私。同时,在严格的资源限制和不确定的网络动态下,在边缘进行单体式执行仍然不可行。为弥合这一差距,我们提出了一种基于微服务的基础模型推理框架,该框架利用了重型核心服务与敏捷轻型服务之间固有的功能不对称性。我们的双层部署策略确保了在资源争用下鲁棒的服务质量(QoS)。具体而言,核心服务通过一个具有稀疏性约束的长期网络感知整数规划进行静态部署,以形成一个容错的骨干网络。另一方面,轻型服务由一个低复杂度的在线控制器动态编排,该控制器将有效容量理论与李雅普诺夫优化相结合,在实时工作负载波动下提供概率性延迟保证。仿真结果表明,我们的框架以适中的部署成本实现了超过84%的平均准时任务完成率,并在系统负载扩展时保持了强大的鲁棒性。