OpenAI's GPT-OSS family provides open-weight language models with explicit chain-of-thought (CoT) reasoning and a Harmony prompt format. We summarize an extensive security evaluation of GPT-OSS-20B that probes the model's behavior under different adversarial conditions. Using the Jailbreak Oracle (JO) [1], a systematic LLM evaluation tool, the study uncovers several failure modes including quant fever, reasoning blackholes, Schrodinger's compliance, reasoning procedure mirage, and chain-oriented prompting. Experiments demonstrate how these behaviors can be exploited on GPT-OSS-20B models, leading to severe consequences.
翻译:OpenAI 的 GPT-OSS 系列提供了具有显式思维链推理能力及 Harmony 提示格式的开放权重语言模型。本文总结了对 GPT-OSS-20B 模型的一次广泛安全评估,旨在探究该模型在不同对抗条件下的行为。研究采用系统性大语言模型评估工具 Jailbreak Oracle (JO) [1],揭示了多种故障模式,包括量化狂热、推理黑洞、薛定谔式合规性、推理过程幻象以及链式导向提示。实验展示了如何利用这些行为在 GPT-OSS-20B 模型上实现攻击,并可能导致严重后果。