Large Language Model (LLM) safety is inherently pluralistic, reflecting variations in moral norms, cultural expectations, and demographic contexts. Yet, existing alignment datasets such as ANTHROPIC-HH and DICES rely on demographically narrow annotator pools, overlooking variation in safety perception across communities. Demo-SafetyBench addresses this gap by modeling demographic pluralism directly at the prompt level, decoupling value framing from responses. In Stage I, prompts from DICES are reclassified into 14 safety domains (adapted from BEAVERTAILS) using Mistral 7B-Instruct-v0.3, retaining demographic metadata and expanding low-resource domains via Llama-3.1-8B-Instruct with SimHash-based deduplication, yielding 43,050 samples. In Stage II, pluralistic sensitivity is evaluated using LLMs-as-Raters-Gemma-7B, GPT-4o, and LLaMA-2-7B-under zero-shot inference. Balanced thresholds (delta = 0.5, tau = 10) achieve high reliability (ICC = 0.87) and low demographic sensitivity (DS = 0.12), confirming that pluralistic safety evaluation can be both scalable and demographically robust.
翻译:大型语言模型(LLM)的安全性本质上是多元的,反映了道德规范、文化期望和人口统计背景的差异。然而,现有的对齐数据集(如ANTHROPIC-HH和DICES)依赖于人口统计特征狭窄的标注者群体,忽视了不同社群间安全感知的差异。Demo-SafetyBench通过在提示层面直接建模人口统计多元性来解决这一差距,将价值框架与响应解耦。在第一阶段,使用Mistral 7B-Instruct-v0.3将DICES中的提示重新分类为14个安全领域(改编自BEAVERTAILS),保留人口统计元数据,并通过Llama-3.1-8B-Instruct结合基于SimHash的去重方法扩展低资源领域,最终生成43,050个样本。在第二阶段,使用LLMs-as-Raters-Gemma-7B、GPT-4o和LLaMA-2-7B在零样本推理下评估多元敏感性。平衡阈值(delta = 0.5, tau = 10)实现了高可靠性(ICC = 0.87)和低人口统计敏感性(DS = 0.12),证实了多元安全评估既具备可扩展性,又在人口统计上具有鲁棒性。