Safe for Whom? Rethinking How We Evaluate the Safety of LLMs for Real Users

Safety evaluations of large language models (LLMs) typically focus on universal risks like dangerous capabilities or undesirable propensities. However, millions use LLMs for personal advice on high-stakes topics like finance and health, where harms are context-dependent rather than universal. While frameworks like the OECD's AI classification recognize the need to assess individual risks, user-welfare safety evaluations remain underdeveloped. We argue that developing such evaluations is non-trivial due to fundamental questions about accounting for user context in evaluation design. In this exploratory study, we evaluated advice on finance and health from GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro across user profiles of varying vulnerability. First, we demonstrate that evaluators must have access to rich user context: identical LLM responses were rated significantly safer by context-blind evaluators than by those aware of user circumstances, with safety scores for high-vulnerability users dropping from safe (5/7) to somewhat unsafe (3/7). One might assume this gap could be addressed by creating realistic user prompts containing key contextual information. However, our second study challenges this: we rerun the evaluation on prompts containing context users report they would disclose, finding no significant improvement. Our work establishes that effective user-welfare safety evaluation requires evaluators to assess responses against diverse user profiles, as realistic user context disclosure alone proves insufficient, particularly for vulnerable populations. By demonstrating a methodology for context-aware evaluation, this study provides both a starting point for such assessments and foundational evidence that evaluating individual welfare demands approaches distinct from existing universal-risk frameworks. We publish our code and dataset to aid future developments.

翻译：大语言模型（LLM）的安全性评估通常聚焦于通用风险，如危险能力或不良倾向。然而，数百万用户在使用LLM获取金融、健康等高风险领域的个人建议时，其危害具有情境依赖性而非普遍性。尽管OECD的AI分类等框架已认识到需评估个体风险，但面向用户福祉的安全性评估仍发展不足。我们认为，由于在评估设计中如何纳入用户情境这一根本性问题尚未解决，开展此类评估绝非易事。在本探索性研究中，我们针对不同脆弱性程度的用户画像，评估了GPT-5、Claude Sonnet 4和Gemini 2.5 Pro在金融和健康领域提供的建议。首先，我们发现评估者必须掌握丰富的用户情境：与了解用户情况的评估者相比，不考虑情境的评估者给出的安全评分显著偏高——针对高脆弱性用户，安全评分从“安全”（5/7）骤降至“较不安全”（3/7）。有人可能会认为，通过构建包含关键情境信息的真实用户提示即可弥补这一差距。然而，我们的第二项研究对此提出质疑：我们在包含用户表示会透露的情境的提示上重复评估，结果未发现显著改善。我们的研究表明，有效的用户福祉安全性评估要求评估者针对多样化用户画像评估回应，因为仅凭真实的用户情境披露远远不够，尤其对脆弱群体而言。通过展示情境感知评估方法论，本研究既为此类评估提供了起点，也为“评估个体福祉需要超越现有通用风险框架的方法”这一结论提供了基础性证据。我们公开了代码与数据集，以助力未来发展。