Safe RLHF Beyond Expectation: Stochastic Dominance for Universal Spectral Risk Control

Safe Reinforcement Learning from Human Feedback (RLHF) typically enforces safety through expected cost constraints, but the expectation captures only a single statistic of the cost distribution and fails to account for distributional uncertainty, particularly under heavy tails or rare catastrophic events. This limitation is problematic when robustness and risk sensitivity are critical. Stochastic dominance offers a principled alternative by comparing entire cost distributions rather than just their averages, enabling direct control over tail risks and potential out-of-distribution failures that expectation-based constraints may overlook. In this work, we propose Risk-sensitive Alignment via Dominance (RAD), a novel alignment framework that replaces scalar expected cost constraints with First-Order Stochastic Dominance (FSD) constraints. We operationalize this constraint by comparing the target policy's cost distribution to that of a reference policy within an Optimal Transport (OT) framework, using entropic regularization and Sinkhorn iterations to obtain a differentiable and computationally efficient objective for stable end-to-end optimization. Furthermore, we introduce quantile-weighted FSD constraints and show that weighted FSD universally controls a broad class of Spectral Risk Measures (SRMs), so that improvements under weighted dominance imply guaranteed improvements in the corresponding spectral risk. This provides a principled mechanism for tuning a model's risk profile via the quantile weighting function. Empirical results demonstrate that RAD improves harmlessness over baselines while remaining competitive in helpfulness, and exhibits greater robustness on out-of-distribution harmlessness evaluations.

翻译：基于人类反馈的安全强化学习（Safe RLHF）通常通过期望成本约束来确保安全性，但期望仅捕捉成本分布的单一统计量，无法考虑分布不确定性，特别是在重尾分布或罕见灾难性事件下。当鲁棒性和风险敏感性至关重要时，这一局限性尤为突出。随机占优通过比较整个成本分布而非仅其平均值，提供了一种原则性替代方案，能够直接控制尾部风险及基于期望约束可能忽略的潜在分布外失效。本文提出风险敏感对齐占优方法（RAD），这是一种新颖的对齐框架，用一阶随机占优（FSD）约束替代标量期望成本约束。我们通过最优传输（OT）框架将目标策略的成本分布与参考策略进行比较来实现该约束，利用熵正则化和Sinkhorn迭代获得可微分且计算高效的目标函数，从而实现稳定的端到端优化。此外，我们引入分位数加权FSD约束，并证明加权FSD能普遍控制一大类谱风险度量（SRM），因此在加权占优下的改进意味着相应谱风险的保证性提升。这为通过分位数加权函数调节模型风险特征提供了原则性机制。实证结果表明，RAD在保持助益性竞争力的同时，相较于基线方法提升了无害性，并在分布外无害性评估中表现出更强的鲁棒性。