Large language models (LLMs) typically deploy safety mechanisms to prevent harmful content generation. Most current approaches focus narrowly on risks posed by malicious actors, often framing risks as adversarial events and relying on defensive refusals. However, in real-world settings, risks also come from non-malicious users seeking help while under psychological distress (e.g., self-harm intentions). In such cases, the model's response can strongly influence the user's next actions. Simple refusals may lead them to repeat, escalate, or move to unsafe platforms, creating worse outcomes. We introduce Constructive Safety Alignment (CSA), a human-centric paradigm that protects against malicious misuse while actively guiding vulnerable users toward safe and helpful results. Implemented in Oyster-I (Oy1), CSA combines game-theoretic anticipation of user reactions, fine-grained risk boundary discovery, and interpretable reasoning control, turning safety into a trust-building process. Oy1 achieves state-of-the-art safety among open models while retaining high general capabilities. On our Constructive Benchmark, it shows strong constructive engagement, close to GPT-5, and unmatched robustness on the Strata-Sword jailbreak dataset, nearing GPT-o1 levels. By shifting from refusal-first to guidance-first safety, CSA redefines the model-user relationship, aiming for systems that are not just safe, but meaningfully helpful. We release Oy1, code, and the benchmark to support responsible, user-centered AI.
翻译:大型语言模型(LLMs)通常部署安全机制以防止有害内容生成。当前大多数方法仅聚焦于恶意行为者带来的风险,常将风险视为对抗性事件并依赖防御性拒绝。然而,在现实场景中,风险也来自处于心理困扰状态(如自伤意图)下寻求帮助的非恶意用户。在此类情形中,模型的回应会强烈影响用户的后续行为。简单的拒绝可能导致其重复请求、升级行为或转向不安全平台,造成更严重的后果。我们提出建构性安全对齐(CSA),这是一种以人为中心的范式,在防范恶意滥用的同时,主动引导脆弱用户获得安全且有益的帮助。通过Oyster-I(Oy1)实现,CSA结合了用户反应的博弈论预测、细粒度风险边界发现以及可解释的推理控制,将安全转化为建立信任的过程。Oy1在开源模型中实现了最先进的安全性,同时保持了强大的通用能力。在我们的建构性基准测试中,它展现出接近GPT-5的强建构性交互能力,并在Strata-Sword越狱数据集上达到接近GPT-o1水平的无与伦比的鲁棒性。通过从“拒绝优先”转向“引导优先”的安全理念,CSA重新定义了模型与用户的关系,致力于构建不仅安全且具有实质帮助价值的系统。我们公开Oy1模型、代码及基准测试,以支持负责任、以用户为中心的人工智能发展。