OpenAI's GPT-OSS family provides open-weight language models with explicit chain-of-thought (CoT) reasoning and a Harmony prompt format. We summarize an extensive security evaluation of GPT-OSS-20B that probes the model's behavior under different adversarial conditions. Using the Jailbreak Oracle (JO) [1], a systematic LLM evaluation tool, the study uncovers several failure modes including quant fever, reasoning blackholes, Schrodinger's compliance, reasoning procedure mirage, and chain-oriented prompting. Experiments demonstrate how these behaviors can be exploited on the GPT-OSS-20B model, leading to severe consequences.
翻译:OpenAI 的 GPT-OSS 系列提供了具有显式思维链推理能力及 Harmony 提示格式的开放权重语言模型。本文总结了对 GPT-OSS-20B 模型进行的一次广泛安全评估,旨在探究该模型在不同对抗条件下的行为。通过使用系统化大语言模型评估工具 Jailbreak Oracle (JO) [1],本研究揭示了多种失效模式,包括量化狂热、推理黑洞、薛定谔式合规、推理过程幻象以及链式导向提示。实验展示了这些行为如何在 GPT-OSS-20B 模型上被利用,并可能导致严重后果。