With the rapid development of artificial general intelligence (AGI), various multimedia services based on pretrained foundation models (PFMs) need to be effectively deployed. With edge servers that have cloud-level computing power, edge intelligence can extend the capabilities of AGI to mobile edge networks. However, compared with cloud data centers, resource-limited edge servers can only cache and execute a small number of PFMs, which typically consist of billions of parameters and require intensive computing power and GPU memory during inference. To address this challenge, in this paper, we propose a joint foundation model caching and inference framework that aims to balance the tradeoff among inference latency, accuracy, and resource consumption by managing cached PFMs and user requests efficiently during the provisioning of generative AI services. Specifically, considering the in-context learning ability of PFMs, a new metric named the Age of Context (AoC), is proposed to model the freshness and relevance between examples in past demonstrations and current service requests. Based on the AoC, we propose a least context caching algorithm to manage cached PFMs at edge servers with historical prompts and inference results. The numerical results demonstrate that the proposed algorithm can reduce system costs compared with existing baselines by effectively utilizing contextual information.
翻译:随着通用人工智能(AGI)的快速发展,基于预训练基础模型(PFMs)的多媒体服务亟需高效部署。通过配备云端级算力的边缘服务器,边缘智能可将AGI能力延伸至移动边缘网络。然而,相较于云数据中心,资源受限的边缘服务器仅能缓存并执行少量PFMs,这些模型通常包含数十亿参数,在推理过程中需要极高的算力和GPU内存。为应对这一挑战,本文提出一种联合基础模型缓存与推理框架,旨在通过高效管理缓存PFMs及用户请求,在生成式AI服务供给过程中平衡推理延迟、准确性与资源消耗之间的权衡。具体而言,考虑到PFMs的上下文学习能力,我们提出一种名为"上下文时效性"(AoC)的新指标,用于量化历史示例与当前服务请求之间示例的新鲜度与相关性。基于AoC,我们提出一种最少上下文缓存算法,通过利用历史提示和推理结果管理边缘服务器上的缓存PFMs。数值结果表明,相较于现有基线方法,所提算法能够通过有效利用上下文信息降低系统成本。