上下文情境化StereoSet：对大语言模型偏见对齐鲁棒性的压力测试 (Contextual StereoSet: Stress-Testing Bias Alignment Robustness in Large Language Models)

A model that avoids stereotypes in a lab benchmark may not avoid them in deployment. We show that measured bias shifts dramatically when prompts mention different places, times, or audiences -- no adversarial prompting required. We introduce Contextual StereoSet, a benchmark that holds stereotype content fixed while systematically varying contextual framing. Testing 13 models across two protocols, we find striking patterns: anchoring to 1990 (vs. 2030) raises stereotype selection in all models tested on this contrast (p<0.05); gossip framing raises it in 5 of 6 full-grid models; out-group observer framing shifts it by up to 13 percentage points. These effects replicate in hiring, lending, and help-seeking vignettes. We propose Context Sensitivity Fingerprints (CSF): a compact profile of per-dimension dispersion and paired contrasts with bootstrap CIs and FDR correction. Two evaluation tracks support different use cases -- a 360-context diagnostic grid for deep analysis and a budgeted protocol covering 4,229 items for production screening. The implication is methodological: bias scores from fixed-condition tests may not generalize.This is not a claim about ground-truth bias rates; it is a stress test of evaluation robustness. CSF forces evaluators to ask, "Under what conditions does bias appear?" rather than "Is this model biased?" We release our benchmark, code, and results.

翻译：在实验室基准测试中避免刻板印象的模型，在实际部署时未必能避免。我们发现，当提示语提及不同地点、时间或受众时，所测量的偏见会发生剧烈变化——无需对抗性提示。我们提出上下文情境化StereoSet基准，该基准在固定刻板印象内容的同时，系统性地改变语境框架。通过对13个模型进行两种协议测试，我们发现了显著规律：锚定在1990年（相较于2030年）使所有测试模型在此对比中的刻板印象选择率上升（p<0.05）；八卦叙事框架使6个全网格模型中的5个选择率上升；外群体观察者框架可导致高达13个百分点的偏移。这些效应在招聘、贷款和求助情境小故事中均得到复现。我们提出语境敏感度指纹（CSF）：一种包含各维度离散度、配对对比（采用自助法置信区间及错误发现率校正）的紧凑特征谱。两个评估轨道支持不同应用场景——包含360个情境的诊断网格用于深度分析，以及覆盖4,229个条目的预算协议用于生产环境筛查。其方法论意义在于：固定条件测试得出的偏见分数可能缺乏泛化性。这并非关于真实偏见率的论断，而是对评估鲁棒性的压力测试。CSF促使评估者追问“偏见在何种条件下显现？”，而非“该模型是否存在偏见？”。我们公开了基准数据集、代码及实验结果。