The rapid development of large language models (LLMs) has driven the widespread adoption of cloud-based LLM inference services, while also bringing prominent privacy risks associated with the transmission and processing of private data in remote inference. For privacy-preserving LLM inference technologies to be practically applied in industrial scenarios, three core requirements must be satisfied simultaneously: (1) Accuracy and efficiency losses should be minimized to mitigate degradation in service experience. (2) The inference process can be run on large-scale clusters consist of heterogeneous legacy xPUs. (3) Compatibility with existing LLM infrastructures should be ensured to reuse their engineering optimizations. To the best of our knowledge, none of the existing privacy-preserving LLM inference methods satisfy all the above constraints while delivering meaningful privacy guarantees. In this paper, we propose AloePri, the first privacy-preserving LLM inference method for industrial applications. AloePri protects both the input and output data by covariant obfuscation, which jointly transforms data and model parameters to achieve better accuracy and privacy. We carefully design the transformation for each model component to ensure inference accuracy and data privacy while keeping full compatibility with existing infrastructures of Language Model as a Service. AloePri has been integrated into an industrial system for the evaluation of mainstream LLMs. The evaluation on Deepseek-V3.1-Terminus model (671B parameters) demonstrates that AloePri causes accuracy loss of 0.0%~3.5% and exhibits efficiency equivalent to that of plaintext inference. Meanwhile, AloePri successfully resists state-of-the-art attacks, with less than 5\% of tokens recovered. To the best of our knowledge, AloePri is the first method to exhibit practical applicability to large-scale models in real-world systems.
翻译:大语言模型(LLM)的快速发展推动了基于云端的LLM推理服务的广泛应用,同时也带来了远程推理中私有数据传输与处理的显著隐私风险。要使隐私保护的LLM推理技术在实际工业场景中得到应用,必须同时满足三个核心要求:(1)应最小化准确性与效率损失,以减轻服务体验的下降。(2)推理过程能够在由异构传统xPU组成的大规模集群上运行。(3)应确保与现有LLM基础设施的兼容性,以复用其工程优化。据我们所知,现有的隐私保护LLM推理方法均无法在提供有意义的隐私保障的同时满足上述所有约束。本文提出AloePri,首个面向工业应用的隐私保护LLM推理方法。AloePri通过协变量混淆同时保护输入和输出数据,该方法联合变换数据和模型参数以实现更好的准确性与隐私性。我们精心设计了每个模型组件的变换,在确保推理准确性和数据隐私的同时,保持与现有语言模型即服务基础设施的完全兼容。AloePri已集成到一个工业系统中,用于评估主流LLM。在Deepseek-V3.1-Terminus模型(6710亿参数)上的评估表明,AloePri导致的准确率损失为0.0%至3.5%,并展现出与明文推理相当的效率。同时,AloePri成功抵御了最先进的攻击,被恢复的token少于5%。据我们所知,AloePri是首个在实际系统中展现出对大规模模型具有实际适用性的方法。