Language Shapes Mental Health Evaluations in Large Language Models

Multilingual large language models (LLMs) are increasingly used in socially sensitive mental health contexts, including support chatbots, screening, and content moderation. This raises a reliability question: do semantically equivalent mental health inputs elicit comparable evaluations across languages, or systematic shifts consistent with language-associated social and cultural contexts? We examine this question in an English-Chinese setting with GPT-4o and Qwen3-32B using a two-level framework: construct-level evaluative orientation, measured by psychometric stigma instruments, and decision-level behavior, measured by binary stigma detection and four-class depression severity classification. Across instruments and models, Chinese prompts elicit higher stigma-related scores than English prompts. At the decision level, Chinese prompts reduce sensitivity to stigmatizing content and produce more conservative depression severity judgments, leading to more under-estimation errors. These findings show that prompt language can shift both evaluative orientation and downstream behavior in LLM-based mental health evaluation. They highlight the need to evaluate multilingual LLMs not only for aggregate performance, but also for whether they apply comparable evaluative standards across languages in socially sensitive domains.

翻译：多语言大语言模型（LLMs）正越来越多地被应用于社会敏感的心理健康场景，包括支持型聊天机器人、筛查和内容审核。这引发了一个可靠性问题：语义上等价的心理健康输入是否会引发跨语言的可比评估，还是会产生与语言相关的社会文化背景一致的系统性偏差？我们在英汉双语环境下，使用GPT-4o和Qwen3-32B，通过一个两层框架对此问题进行检验：结构层面的评估取向，通过心理测量学中的污名量表进行衡量；以及决策层面的行为，通过二值污名检测和四类抑郁症严重程度分类进行测量。在多个工具和模型中，中文提示词相比英文提示词引发了更高的污名相关评分。在决策层面，中文提示词降低了对污名化内容的敏感性，并产生了更保守的抑郁症严重程度判断，导致更多的低估错误。这些发现表明，提示词的语言可以改变基于LLM的心理健康评估中的评估取向和下游行为。它们凸显了评估多语言LLM的必要性——不仅要评估其整体性能，还要评估它们是否在社会敏感领域中跨语言应用了可比较的评估标准。

相关内容

健康

关注 27

健康是指一个人在身体、精神和社会等方面都处于良好的状态。健康包括两个方面的内容：

一是主要脏器无疾病，身体形态发育良好，体形均匀，人体各系统具有良好的生理功能，有较强的身体活动能力和劳动能力，这是对健康最基本的要求；

二是对疾病的抵抗能力较强，能够适应环境变化，各种生理刺激以及致病因素对身体的作用。传统的健康观是“无病即健康”，现代人的健康观是整体健康，世界卫生组织提出“健康不仅是躯体没有疾病，还要具备心理健康、社会适应良好和有道德”。因此，现代人的健康内容包括：躯体健康、心理健康、心灵健康、社会健康、智力健康、道德健康、环境健康等。健康是人的基本权利。健康是人生的第一财富。

评估大语言模型在科学发现中的作用

专知会员服务

19+阅读 · 2025年12月19日

北大团队发布首篇大语言模型心理测量学系统综述：评估、验证、增强

专知会员服务

10+阅读 · 2025年5月27日

【斯坦福博士论文】大语言模型的AI辅助评估

专知会员服务

31+阅读 · 2025年3月30日

《大型语言模型情感认知》最新进展

专知会员服务

43+阅读 · 2024年10月3日