We introduce the first model-stealing attack that extracts precise, nontrivial information from black-box production language models like OpenAI's ChatGPT or Google's PaLM-2. Specifically, our attack recovers the embedding projection layer (up to symmetries) of a transformer model, given typical API access. For under \$20 USD, our attack extracts the entire projection matrix of OpenAI's Ada and Babbage language models. We thereby confirm, for the first time, that these black-box models have a hidden dimension of 1024 and 2048, respectively. We also recover the exact hidden dimension size of the gpt-3.5-turbo model, and estimate it would cost under $2,000 in queries to recover the entire projection matrix. We conclude with potential defenses and mitigations, and discuss the implications of possible future work that could extend our attack.
翻译:我们首次提出一种模型窃取攻击方法,能够从黑盒生产级语言模型(如OpenAI的ChatGPT或Google的PaLM-2)中提取精确且非平凡的信息。具体而言,在仅具备典型API访问权限的条件下,我们的攻击能够恢复Transformer模型的嵌入投影层(至对称变换意义下的等价形式)。以低于20美元的成本,我们成功提取了OpenAI的Ada和Babbage语言模型的完整投影矩阵,并首次确认这些黑盒模型的隐藏维度分别为1024和2048。我们还精确恢复了gpt-3.5-turbo模型的隐藏维度大小,并估算出提取其完整投影矩阵所需的查询成本将低于2000美元。最后,我们探讨了潜在的防御与缓解措施,并讨论了未来可能扩展本攻击方法的研究方向及其影响。