Model Leeching is a novel extraction attack targeting Large Language Models (LLMs), capable of distilling task-specific knowledge from a target LLM into a reduced parameter model. We demonstrate the effectiveness of our attack by extracting task capability from ChatGPT-3.5-Turbo, achieving 73% Exact Match (EM) similarity, and SQuAD EM and F1 accuracy scores of 75% and 87%, respectively for only $50 in API cost. We further demonstrate the feasibility of adversarial attack transferability from an extracted model extracted via Model Leeching to perform ML attack staging against a target LLM, resulting in an 11% increase to attack success rate when applied to ChatGPT-3.5-Turbo.
翻译:模型吸血是一种新型的提取攻击,旨在针对大型语言模型(LLM),能够从目标LLM中蒸馏出特定任务的知识,并将其压缩至参数规模更小的模型中。我们通过从ChatGPT-3.5-Turbo中提取任务能力,验证了该攻击的有效性:仅花费50美元API成本,即可实现73%的精确匹配(EM)相似度,以及在SQuAD数据集上75%的EM准确率和87%的F1准确率。我们进一步证明了通过模型吸血提取的模型具备对抗性攻击的可迁移性,可用于对目标LLM实施机器学习攻击前奏,从而在对ChatGPT-3.5-Turbo的攻击中使攻击成功率提升11%。