Speculative decoding in large language models (LLMs) accelerates token generation by speculatively predicting multiple tokens cheaply and verifying them in parallel, and has been widely deployed. In this paper, we provide the first study demonstrating the privacy risks of speculative decoding. We observe that input-dependent patterns of correct and incorrect predictions can be leaked out to an adversary monitoring token generation times and packet sizes, leading to privacy breaches. By observing the pattern of correctly and incorrectly speculated tokens, we show that a malicious adversary can fingerprint queries and learn private user inputs with more than $90\%$ accuracy across three different speculative decoding techniques - REST (almost $100\%$ accuracy), LADE (up to $92\%$ accuracy), and BiLD (up to $95\%$ accuracy). We show that an adversary can also leak out confidential intellectual property used to design these techniques, such as data from data-stores used for prediction (in REST) at a rate of more than $25$ tokens per second, or even hyper-parameters used for prediction (in LADE). We also discuss mitigation strategies, such as aggregating tokens across multiple iterations and padding packets with additional bytes, to avoid such privacy or confidentiality breaches.
翻译:大语言模型(LLM)中的推测解码通过廉价地推测预测多个令牌并进行并行验证来加速令牌生成,并已得到广泛部署。本文首次研究了推测解码的隐私风险。我们观察到,正确与错误预测的输入依赖模式可能泄露给监控令牌生成时间和数据包大小的攻击者,从而导致隐私泄露。通过观察正确与错误推测令牌的模式,我们证明恶意攻击者能够对查询进行指纹识别,并在三种不同的推测解码技术中——REST(准确率接近100%)、LADE(准确率高达92%)和BiLD(准确率高达95%)——以超过90%的准确率学习到用户的隐私输入。我们还证明攻击者能够泄露用于设计这些技术的机密知识产权,例如用于预测的数据存储中的数据(在REST中),泄露速率超过每秒25个令牌,甚至可能泄露用于预测的超参数(在LADE中)。最后,我们讨论了缓解策略,例如跨多次迭代聚合令牌、用额外字节填充数据包等,以避免此类隐私或机密性泄露。