Oyster-I: Beyond Refusal -- Constructive Safety Alignment for Responsible Language Models

Ranjie Duan,Jiexi Liu,Xiaojun Jia,Shiji Zhao,Ruoxi Cheng,Fengxiang Wang,Cheng Wei,Yong Xie,Chang Liu,Defeng Li,Yinpeng Dong,Yichi Zhang,Yuefeng Chen,Chongwen Wang,Xingjun Ma,Xingxing Wei,Yang Liu,Hang Su,Jun Zhu,Xinfeng Li,Yitong Sun,Jie Zhang,Jinzhao Hu,Sha Xu,Wenchao Yang,Yitong Yang,Xingyao Zhang,Yingshui Tan,Jialing Tao,Hui Xue

from arxiv, Technical Report Code & Model weights available: https://github.com/Alibaba-AAIG/Oyster

Large language models (LLMs) typically deploy safety mechanisms to prevent harmful content generation. Most current approaches focus narrowly on risks posed by malicious actors, often framing risks as adversarial events and relying on defensive refusals. However, in real-world settings, risks also come from non-malicious users seeking help while under psychological distress (e.g., self-harm intentions). In such cases, the model's response can strongly influence the user's next actions. Simple refusals may lead them to repeat, escalate, or move to unsafe platforms, creating worse outcomes. We introduce Constructive Safety Alignment (CSA), a human-centric paradigm that protects against malicious misuse while actively guiding vulnerable users toward safe and helpful results. Implemented in Oyster-I (Oy1), CSA combines game-theoretic anticipation of user reactions, fine-grained risk boundary discovery, and interpretable reasoning control, turning safety into a trust-building process. Oy1 achieves state-of-the-art safety among open models while retaining high general capabilities. On our Constructive Benchmark, it shows strong constructive engagement, close to GPT-5, and unmatched robustness on the Strata-Sword jailbreak dataset, nearing GPT-o1 levels. By shifting from refusal-first to guidance-first safety, CSA redefines the model-user relationship, aiming for systems that are not just safe, but meaningfully helpful. We release Oy1, code, and the benchmark to support responsible, user-centered AI.

翻译：大型语言模型（LLMs）通常部署安全机制以防止有害内容生成。当前大多数方法仅聚焦于恶意行为者带来的风险，常将风险视为对抗性事件并依赖防御性拒绝。然而，在现实场景中，风险也来自处于心理困扰状态（如自伤意图）下寻求帮助的非恶意用户。在此类情形中，模型的回应会强烈影响用户的后续行为。简单的拒绝可能导致其重复请求、升级行为或转向不安全平台，造成更严重的后果。我们提出建构性安全对齐（CSA），这是一种以人为中心的范式，在防范恶意滥用的同时，主动引导脆弱用户获得安全且有益的帮助。通过Oyster-I（Oy1）实现，CSA结合了用户反应的博弈论预测、细粒度风险边界发现以及可解释的推理控制，将安全转化为建立信任的过程。Oy1在开源模型中实现了最先进的安全性，同时保持了强大的通用能力。在我们的建构性基准测试中，它展现出接近GPT-5的强建构性交互能力，并在Strata-Sword越狱数据集上达到接近GPT-o1水平的无与伦比的鲁棒性。通过从“拒绝优先”转向“引导优先”的安全理念，CSA重新定义了模型与用户的关系，致力于构建不仅安全且具有实质帮助价值的系统。我们公开Oy1模型、代码及基准测试，以支持负责任、以用户为中心的人工智能发展。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/