Large language models(LLMs) have shown its outperforming ability on various tasks and question answering. However, LLMs require high computation cost and large memory cost. At the same time, LLMs may cause privacy leakage when training or prediction procedure contains sensitive information. In this paper, we propose SPA(Side Plugin Adaption), a lightweight architecture for fast on-devices inference and privacy retaining on the constraints of strict on-devices computation and memory constraints. Compared with other on-devices seq2seq generation, SPA could make a fast and stable inference on low-resource constraints, allowing it to obtain cost effiency. Our method establish an interaction between a pretrained LLMs on-cloud and additive parameters on-devices, which could provide the knowledge on both pretrained LLMs and private personal feature.Further more, SPA provides a framework to keep feature-base parameters on private guaranteed but low computational devices while leave the parameters containing general information on the high computational devices.
翻译:大型语言模型(LLMs)在各类任务和问答中展现了卓越性能。然而,LLMs需要高额计算成本和大量内存开销。同时,当训练或预测过程包含敏感信息时,LLMs可能引发隐私泄露问题。本文提出SPA(侧插件适配),一种轻量级架构,可在严格的设备端计算和内存约束下实现快速设备端推理并保护隐私。相较于其他设备端序列到序列生成方法,SPA能在低资源约束下实现快速稳定的推理,从而获得成本效益。我们的方法建立了云端预训练LLM与设备端附加参数之间的交互机制,既能利用预训练LLM的知识,又能整合个人隐私特征。此外,SPA提供了一种框架,将基于特征的参数保留在低算力但隐私受保护的设备端,而将包含通用信息的参数部署在高算力设备端。