As Large Language Models (LLMs) gain traction across critical domains, ensuring secure and trustworthy training processes has become a major concern. Backdoor attacks, where malicious actors inject hidden triggers into training data, are particularly insidious and difficult to detect. Existing post-training verification solutions like Proof-of-Learning are impractical for LLMs due to their requirement for full retraining, lack of robustness against stealthy manipulations, and inability to provide early detection during training. Early detection would significantly reduce computational costs. To address these limitations, we introduce Proof-of-Training Steps, a verification protocol that enables an independent auditor (Alice) to confirm that an LLM developer (Bob) has followed the declared training recipe, including data batches, architecture, and hyperparameters. By analyzing the sensitivity of the LLMs' language modeling head (LM-Head) to input perturbations, our method can expose subtle backdoor injections or deviations in training. Even with backdoor triggers in up to 10 percent of the training data, our protocol significantly reduces the attacker's ability to achieve a high attack success rate (ASR). Our method enables early detection of attacks at the injection step, with verification steps being 3x faster than training steps. Our results highlight the protocol's potential to enhance the accountability and security of LLM development, especially against insider threats.
翻译:随着大语言模型(LLM)在关键领域的广泛应用,确保其训练过程的安全性与可信性已成为重要议题。后门攻击通过恶意向训练数据注入隐藏触发器,具有隐蔽性强、检测困难的特点。现有训练后验证方案(如训练证明)因需完整重训练、对隐蔽操作缺乏鲁棒性且无法在训练阶段实现早期检测,难以适用于大语言模型。早期检测将显著降低计算成本。为突破这些限制,我们提出训练步骤验证协议,使独立审计方(Alice)能够验证LLM开发者(Bob)是否遵循声明的训练方案,包括数据批次、架构与超参数。通过分析LLM语言建模头对输入扰动的敏感性,本方法能够揭示训练过程中细微的后门注入或偏差。即使在高达10%的训练数据中存在后门触发器,本协议仍能显著降低攻击者实现高攻击成功率的能力。该方法可在注入阶段实现攻击的早期检测,验证步骤速度较训练步骤快3倍。实验结果凸显了该协议在增强LLM开发问责制与安全性方面的潜力,尤其针对内部威胁场景。