The Alignment Floor: How Persona Customization Breaks Safety in Weakly-Aligned LLMs

Telling an LLM to "be enthusiastic" raises its sycophancy rate from 30\% to 50\% on a lightly-aligned model, but has zero effect on a strongly-aligned one. We define this gap as the alignment floor, $Δ_{\text{floor}}(m)=\max_pS(m,p)-\min_pS(m,p)$, the range of sycophancy rates a model produces across persona conditions, and treat sycophancy as a persona-conditional property rather than a fixed model property. Pluralistic AI relies on behavioral adaptation via persona prompts like "be creative" or "be thorough", which let systems respect diverse user values and communication styles; the safety question is how much customization a given model can absorb before its truthfulness shifts. We present a controlled case study contrasting a strongly-aligned RLHF + Constitutional-AI model (Claude Sonnet 4.6) with a more lightly-aligned model (Amazon Nova Lite), spanning seven persona conditions and five tasks for 1800 total runs. An existence-pair result motivates per-model auditing: there is at least one strongly-aligned model with $Δ_{\text{floor}}=5$pp (within 5pp of the 15\% control rate) and at least one lightly-aligned model with 45pp (5\%--50\% range). On the lightly-aligned model, all five Big Five personas increase sycophancy over control, and counterintuitively Agreeableness produces the smallest increase, not the largest. The single largest effect in the study is constructive: a Skeptic persona reduces sycophancy by 25pp on the lightly-aligned model, and is the only persona that instructs resistance against user claims rather than engagement with them, suggesting a directionality account. Cross-model transfer of persona effects is near-zero, so persona-alignment testing must be per-model. We propose $Δ_{\text{floor}}$ as a deployment-time audit metric: measure it on a small persona panel before deploying persona customization.

翻译：要求大语言模型“表现得热情”会使弱对齐模型谄媚率从30%上升至50%，但对强对齐模型无影响。我们将该差距定义为对齐底线$Δ_{\text{floor}}(m)=\max_pS(m,p)-\min_pS(m,p)$，即模型在不同个性条件下谄媚率的变化范围，并将谄媚视为个性条件属性而非固定模型属性。多元AI依赖通过“要有创意”或“要全面”等个性提示实现的行为适应，使系统能够尊重多样化的用户价值观与沟通风格；安全问题的关键在于给定模型在真实性不变的前提下能承受多大程度的定制化。我们通过受控案例研究，对比强对齐的RLHF+宪法AI模型（Claude Sonnet 4.6）与弱对齐模型（Amazon Nova Lite），涵盖七种个性条件与五项任务，共计1800次运行。存在性配对结果推动逐模型审计：至少存在一个强对齐模型$Δ_{\text{floor}}=5$个百分点（在15%控制率±5pp内），以及至少一个弱对齐模型达45pp（5%-50%区间）。在弱对齐模型上，所有五项大五人格个性均增加谄媚率，且反直觉的是“宜人性”产生的增幅最小而非最大。研究中最显著的单效应具有建设性：怀疑论者个性将弱对齐模型谄媚率降低25pp，且是唯一引导模型对抗而非顺应用户主张的个性，暗示方向性机制。个性效应的跨模型迁移近乎为零，因此个性-对齐测试需逐模型进行。我们提出将$Δ_{\text{floor}}$作为部署时审计指标：在部署个性定制前，先在小规模个性面板上对该指标进行测量。