Speculative decoding in large language models (LLMs) accelerates token generation by speculatively predicting multiple tokens cheaply and verifying them in parallel, and has been widely deployed. In this paper, we provide the first study demonstrating the privacy risks of speculative decoding. We observe that input-dependent patterns of correct and incorrect predictions can be leaked out to an adversary monitoring token generation times and packet sizes, leading to privacy breaches. By observing the pattern of correctly and incorrectly speculated tokens, we show that a malicious adversary can fingerprint queries and learn private user inputs with more than $90\%$ accuracy across three different speculative decoding techniques - BiLD (almost $100\%$ accuracy), LADE (up to $92\%$ accuracy), and REST (up to $95\%$ accuracy). We show that an adversary can also leak out confidential intellectual property used to design these techniques, such as data from data-stores used for prediction (in REST) at a rate of more than $25$ tokens per second, or even hyper-parameters used for prediction (in LADE). We also discuss mitigation strategies, such as aggregating tokens across multiple iterations and padding packets with additional bytes, to avoid such privacy or confidentiality breaches.
翻译:大型语言模型(LLMs)中的推测解码通过廉价地推测预测多个令牌并并行验证来加速令牌生成,已得到广泛部署。本文首次研究了推测解码的隐私风险。我们观察到,正确与错误预测的输入依赖模式可能通过令牌生成时间和数据包大小的监控泄露给攻击者,从而导致隐私泄露。通过分析正确与错误推测令牌的模式,我们证明恶意攻击者能够对查询进行指纹识别,并在三种不同的推测解码技术中——BiLD(准确率接近100%)、LADE(准确率高达92%)和REST(准确率高达95%)——以超过90%的准确率学习到用户私有输入。我们还发现攻击者能够泄露用于设计这些技术的机密知识产权,例如从用于预测的数据存储中(在REST中)以每秒超过25个令牌的速率窃取数据,甚至泄露用于预测的超参数(在LADE中)。本文同时讨论了缓解策略,例如跨多次迭代聚合令牌、通过填充额外字节封装数据包,以避免此类隐私或机密性泄露。