Evaluating the value alignment of large language models (LLMs) has traditionally relied on single-sentence adversarial prompts, which directly probe models with ethically sensitive or controversial questions. However, with the rapid advancements in AI safety techniques, models have become increasingly adept at circumventing these straightforward tests, limiting their effectiveness in revealing underlying biases and ethical stances. To address this limitation, we propose an upgraded value alignment benchmark that moves beyond single-sentence prompts by incorporating multi-turn dialogues and narrative-based scenarios. This approach enhances the stealth and adversarial nature of the evaluation, making it more robust against superficial safeguards implemented in modern LLMs. We design and implement a dataset that includes conversational traps and ethically ambiguous storytelling, systematically assessing LLMs' responses in more nuanced and context-rich settings. Experimental results demonstrate that this enhanced methodology can effectively expose latent biases that remain undetected in traditional single-shot evaluations. Our findings highlight the necessity of contextual and dynamic testing for value alignment in LLMs, paving the way for more sophisticated and realistic assessments of AI ethics and safety.
翻译:评估大型语言模型(LLM)的价值对齐传统上依赖于单句对抗性提示,即直接向模型提出涉及伦理敏感或争议性问题的查询。然而,随着人工智能安全技术的快速发展,模型已日益擅长规避此类直接测试,限制了其在揭示潜在偏见和伦理立场方面的有效性。为应对这一局限,我们提出一种升级的价值对齐基准,通过引入多轮对话和基于叙事的场景,超越单句提示的框架。该方法增强了评估的隐蔽性和对抗性,使其对现代LLM中实施的表面防护机制更具鲁棒性。我们设计并实现了一个包含对话陷阱和伦理模糊叙事的数据集,系统评估LLM在更细致、语境更丰富的设置中的响应。实验结果表明,这种增强方法能有效暴露传统单次评估中未能检测的潜在偏见。我们的研究结果凸显了对LLM进行语境化动态测试以评估价值对齐的必要性,为更复杂、更现实的人工智能伦理与安全评估开辟了道路。