Can Large Language Models Make Everyone Happy?

Misalignment in Large Language Models (LLMs) refers to the failure to simultaneously satisfy safety, value, and cultural dimensions, leading to behaviors that diverge from human expectations in real-world settings where these dimensions must co-occur. Existing benchmarks, such as SAFETUNEBED (safety-centric), VALUEBENCH (value-centric), and WORLDVIEW-BENCH (culture-centric), primarily evaluate these dimensions in isolation and therefore provide limited insight into their interactions and trade-offs. More recent efforts, including MIB and INTERPRETABILITY BENCHMARK-based on mechanistic interpretability, offer valuable perspectives on model failures; however, they remain insufficient for systematically characterizing cross-dimensional trade-offs. To address these gaps, we introduce MisAlign-Profile, a unified benchmark for measuring misalignment trade-offs inspired by mechanistic profiling. First, we construct MISALIGNTRADE, an English misaligned-aligned dataset across 112 normative domains taxonomies, including 14 safety, 56 value, and 42 cultural domains. In addition to domain labels, each prompt is classified with one of three orthogonal semantic types-object, attribute, or relations misalignment-using Gemma-2-9B-it and expanded via Qwen3-30B-A3B-Instruct-2507 with SimHash-based fingerprinting to avoid deduplication. Each prompt is paired with misaligned and aligned responses through two-stage rejection sampling to ensure quality. Second, we benchmark general-purpose, fine-tuned, and open-weight LLMs on MISALIGNTRADE-revealing 12%-34% misalignment trade-offs across dimensions.

翻译：大型语言模型（LLMs）中的失准现象，指的是模型无法同时满足安全性、价值观和文化维度要求，导致在现实场景中这些维度必须共存时，其行为偏离人类预期。现有基准测试，如以安全为中心的SAFETUNEBED、以价值观为中心的VALUEBENCH和以文化为中心的WORLDVIEW-BENCH，主要对这些维度进行孤立评估，因此对其相互作用和权衡关系的洞察有限。基于机制可解释性的最新研究，包括MIB和INTERPRETABILITY BENCHMARK，为模型失效提供了有价值的视角；然而，它们仍不足以系统刻画跨维度权衡关系。为填补这些空白，我们提出MisAlign-Profile——一个受机制剖析启发的、用于衡量失准权衡的统一基准。首先，我们构建了MISALIGNTRADE数据集，这是一个涵盖112个规范性领域分类（包括14个安全领域、56个价值观领域和42个文化领域）的英文失准-校准数据集。除领域标签外，每个提示均通过Gemma-2-9B-it模型被分类为三种正交语义类型（对象失准、属性失准或关系失准）之一，并借助Qwen3-30B-A3B-Instruct-2507模型通过基于SimHash指纹的去重技术进行扩展。每个提示均通过两阶段拒绝采样与失准和校准响应配对，以确保数据质量。其次，我们在MISALIGNTRADE上对通用型、微调型和开源权重LLMs进行基准测试，揭示了跨维度12%-34%的失准权衡现象。