Large language models(LLMs) have shown its outperforming ability on various tasks and question answering. However, LLMs require substantial memory storage on low-resource devices. More critically, the computational speed on these devices is also severely limited. In this paper, we propose SPA(Side Plugin Adaption), a lightweight architecture for fast on-devices inference on the constraints of strict on-devices computation and memory constraints. Compared with other on-devices seq2seq generation, SPA could make a fast and stable inference on low-resource constraints, allowing it to obtain cost effiency. Our method establish an interaction between a pretrained LLMs on-cloud and additive parameters on-devices, which could provide the knowledge on both pretrained LLMs and featured personal feature. Further more, SPA provides a framework to keep feature-base parameters on low computational devices while leave the parameters containing general information on the high computational devices.
翻译:大型语言模型(LLM)在各种任务和问答场景中展现出卓越性能。然而,LLM在低资源设备上需要大量内存存储。更关键的是,这些设备上的计算速度也受到严重限制。本文提出SPA(侧插件适配),一种轻量级架构,用于在严格的端侧计算与内存约束下实现快速设备端推理。相较于其他端侧序列到序列生成方法,SPA能够在低资源约束下实现快速稳定的推理,从而获得成本效益。我们的方法建立了云端预训练LLM与设备端附加参数之间的交互机制,既能利用预训练LLM的知识,又能融合个性化特征。此外,SPA提供了一种框架,可将基于特征的参数保留在低算力设备上,而将包含通用信息的参数部署在高算力设备上。