This paper explores the pressing issue of risk assessment in Large Language Models (LLMs) as they become increasingly prevalent in various applications. Focusing on how reward models, which are designed to fine-tune pretrained LLMs to align with human values, perceive and categorize different types of risks, we delve into the challenges posed by the subjective nature of preference-based training data. By utilizing the Anthropic Red-team dataset, we analyze major risk categories, including Information Hazards, Malicious Uses, and Discrimination/Hateful content. Our findings indicate that LLMs tend to consider Information Hazards less harmful, a finding confirmed by a specially developed regression model. Additionally, our analysis shows that LLMs respond less stringently to Information Hazards compared to other risks. The study further reveals a significant vulnerability of LLMs to jailbreaking attacks in Information Hazard scenarios, highlighting a critical security concern in LLM risk assessment and emphasizing the need for improved AI safety measures.
翻译:本文探讨了大型语言模型(LLMs)在各类应用中日益普及时所面临的风险评估紧迫问题。通过聚焦于旨在微调预训练LLMs以使其与人类价值观对齐的奖励模型如何感知和分类不同风险类型,我们深入剖析了偏好训练数据的主观性所带来的挑战。借助Anthropic红队数据集,我们分析了主要风险类别,包括信息危害、恶意使用以及歧视/仇恨内容。我们的发现表明,LLMs倾向于认为信息危害风险较低,这一结果通过专门开发的回归模型得到了验证。此外,我们的分析显示,LLMs对信息危害的响应严格程度低于其他风险。研究进一步揭示了LLMs在信息危害场景下对越狱攻击的显著脆弱性,凸显了LLM风险评估中的关键安全问题,并强调了提升AI安全措施的必要性。