Current safety alignment for Large Language Models (LLMs) implicitly optimizes for a "modal adult user," leaving models vulnerable to distributional shifts in user cognition. We present ChildSafe, a benchmark that quantifies alignment robustness under cognitive shifts corresponding to four developmental stages. Unlike static persona-based evaluations, we introduce a parametric cognitive simulation approach, formalizing developmental stages as hyperparameter constraints (e.g., volatility, context horizon) to generate out-of-distribution interaction traces. We validate these agents against ground-truth human linguistic data (CHILDES) and deploy them across 1,200 multi-turn interactions. Our results reveal a systematic alignment generalization gap: state-of-the-art models exhibit up to 11.5% performance degradation when interacting with early-childhood agents compared to standard baselines. We provide the research community with the validated agent artifacts and evaluation protocols to facilitate robust alignment testing against non-adversarial, cognitively diverse populations.
翻译:当前大型语言模型的安全对齐机制隐式地以"典型成年用户"为优化目标,导致模型在面对用户认知能力的分布偏移时存在脆弱性。本文提出ChildSafe基准测试,通过量化模型在对应四个发展阶段的认知偏移下的对齐鲁棒性来解决这一问题。与基于静态人物角色的评估方法不同,我们引入参数化认知模拟方法,将发展阶段形式化为超参数约束(如认知波动性、语境跨度)以生成分布外交互轨迹。我们通过真实人类语言数据(CHILDES)验证这些智能体,并在1,200轮多轮交互中部署测试。研究结果揭示了系统性的对齐泛化缺陷:与标准基线相比,最先进模型在与早期儿童智能体交互时性能下降高达11.5%。我们向研究社区提供经验证的智能体构件与评估协议,以促进针对非对抗性、认知多样化群体的鲁棒对齐测试。