Growing reliance on LLMs for psychiatric self-assessment raises questions about their ability to interpret qualitative patient narratives. This depth-first case study provides the first direct comparison of state-of-the-art LLMs and mental health professionals in assessing Borderline (BPD) and Narcissistic (NPD) Personality Disorders based on Polish-language first-person autobiographical accounts. Within our sample, the overall diagnostic scores of the top-performing Gemini Pro models (65.48%) were 21.91 percentage points higher than the average scores of the human professionals (43.57%). While both models and human experts excelled at identifying BPD (F1 = 83.4 & F1 = 80.0, respectively), models severely underdiagnosed NPD (F1 = 6.7 vs. 50.0), showing a potential reluctance toward the value-laden term "narcissism." Qualitatively, models provided confident, elaborate justifications focused on patterns and formal categories, while human experts remained concise and cautious, emphasizing the patients' sense of self and temporal experience. Our findings demonstrate that while LLMs might be competent at interpreting complex first-person clinical data, their outputs still carry critical reliability and bias issues.
翻译:随着对大型语言模型用于精神科自我评估的日益依赖,引发了对它们解读定性患者叙述能力的质疑。本深度案例研究首次直接比较了最先进的大语言模型与心理健康专业人员,基于波兰语第一人称自传体叙述,评估边缘型人格障碍和自恋型人格障碍的能力。在我们的样本中,表现最佳的Gemini Pro模型的总体诊断得分(65.48%)比人类专业人员的平均得分(43.57%)高21.91个百分点。虽然模型和人类专家在识别BPD方面均表现出色(F1值分别为83.4和80.0),但模型严重低估了NPD(F1值为6.7 vs. 50.0),显示出对带有价值判断的术语“自恋”可能存在的回避倾向。在定性方面,模型提供了自信、详尽且侧重于模式与形式类别的论证,而人类专家则保持简洁与谨慎,强调患者的自我感与时间体验。我们的研究结果表明,尽管大语言模型可能胜任解读复杂的第一人称临床数据,但其输出仍存在关键的可信度与偏差问题。