Stop Testing Attacks, Start Diagnosing Defenses: The Four-Checkpoint Framework Reveals Where LLM Safety Breaks

Large Language Models (LLMs) deploy safety mechanisms to prevent harmful outputs, yet these defenses remain vulnerable to adversarial prompts. While existing research demonstrates that jailbreak attacks succeed, it does not explain \textit{where} defenses fail or \textit{why}. To address this gap, we propose that LLM safety operates as a sequential pipeline with distinct checkpoints. We introduce the \textbf{Four-Checkpoint Framework}, which organizes safety mechanisms along two dimensions: processing stage (input vs.\ output) and detection level (literal vs.\ intent). This creates four checkpoints, CP1 through CP4, each representing a defensive layer that can be independently evaluated. We design 13 evasion techniques, each targeting a specific checkpoint, enabling controlled testing of individual defensive layers. Using this framework, we evaluate GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro across 3,312 single-turn, black-box test cases. We employ an LLM-as-judge approach for response classification and introduce Weighted Attack Success Rate (WASR), a severity-adjusted metric that captures partial information leakage overlooked by binary evaluation. Our evaluation reveals clear patterns. Traditional Binary ASR reports 22.6\% attack success. However, WASR reveals 52.7\%, a 2.3$\times$ higher vulnerability. Output-stage defenses (CP3, CP4) prove weakest at 72--79\% WASR, while input-literal defenses (CP1) are strongest at 13\% WASR. Claude achieves the strongest safety (42.8\% WASR), followed by GPT-5 (55.9\%) and Gemini (59.5\%). These findings suggest that current defenses are strongest at input-literal checkpoints but remain vulnerable to intent-level manipulation and output-stage techniques. The Four-Checkpoint Framework provides a structured approach for identifying and addressing safety vulnerabilities in deployed systems.

翻译：大语言模型部署了安全机制以防止有害输出，然而这些防御措施在面对对抗性提示时仍显脆弱。现有研究虽已证明越狱攻击能够成功，但并未解释防御措施在何处失效及其原因。为填补这一空白，我们提出大语言模型安全机制是一个包含多个独立检查点的顺序处理流程。我们引入了**四检查点框架**，该框架从两个维度组织安全机制：处理阶段（输入阶段 vs. 输出阶段）和检测层级（字面检测 vs. 意图检测）。由此形成四个检查点（CP1至CP4），每个检查点代表一个可独立评估的防御层。我们设计了13种规避技术，每种技术针对特定检查点，从而实现对单个防御层的受控测试。运用此框架，我们在3,312个单轮次黑盒测试案例中评估了GPT-5、Claude Sonnet 4和Gemini 2.5 Pro。我们采用大语言模型作为评判者的方法进行响应分类，并引入加权攻击成功率——一种经过严重性调整的度量指标，能够捕捉被二元评估所忽略的部分信息泄露。评估结果呈现出清晰模式：传统二元攻击成功率报告的攻击成功率为22.6%，而加权攻击成功率则揭示出52.7%的漏洞率，脆弱性高出2.3倍。输出阶段防御（CP3、CP4）最为薄弱，加权攻击成功率高达72-79%；而输入字面防御（CP1）最为坚固，加权攻击成功率仅为13%。Claude展现出最强的安全性（加权攻击成功率42.8%），其次是GPT-5（55.9%）和Gemini（59.5%）。这些发现表明，当前防御体系在输入字面检查点最为有效，但在意图层级操纵和输出阶段技术面前仍存在漏洞。四检查点框架为识别和解决已部署系统中的安全脆弱性提供了结构化方法。