Multilingual large language models (LLMs) are increasingly used in socially sensitive mental health contexts, including support chatbots, screening, and content moderation. This raises a reliability question: do semantically equivalent mental health inputs elicit comparable evaluations across languages, or systematic shifts consistent with language-associated social and cultural contexts? We examine this question in an English-Chinese setting with GPT-4o and Qwen3-32B using a two-level framework: construct-level evaluative orientation, measured by psychometric stigma instruments, and decision-level behavior, measured by binary stigma detection and four-class depression severity classification. Across instruments and models, Chinese prompts elicit higher stigma-related scores than English prompts. At the decision level, Chinese prompts reduce sensitivity to stigmatizing content and produce more conservative depression severity judgments, leading to more under-estimation errors. These findings show that prompt language can shift both evaluative orientation and downstream behavior in LLM-based mental health evaluation. They highlight the need to evaluate multilingual LLMs not only for aggregate performance, but also for whether they apply comparable evaluative standards across languages in socially sensitive domains.
翻译:多语言大语言模型(LLMs)正越来越多地被应用于社会敏感的心理健康场景,包括支持型聊天机器人、筛查和内容审核。这引发了一个可靠性问题:语义上等价的心理健康输入是否会引发跨语言的可比评估,还是会产生与语言相关的社会文化背景一致的系统性偏差?我们在英汉双语环境下,使用GPT-4o和Qwen3-32B,通过一个两层框架对此问题进行检验:结构层面的评估取向,通过心理测量学中的污名量表进行衡量;以及决策层面的行为,通过二值污名检测和四类抑郁症严重程度分类进行测量。在多个工具和模型中,中文提示词相比英文提示词引发了更高的污名相关评分。在决策层面,中文提示词降低了对污名化内容的敏感性,并产生了更保守的抑郁症严重程度判断,导致更多的低估错误。这些发现表明,提示词的语言可以改变基于LLM的心理健康评估中的评估取向和下游行为。它们凸显了评估多语言LLM的必要性——不仅要评估其整体性能,还要评估它们是否在社会敏感领域中跨语言应用了可比较的评估标准。