Aligning Large Language Models for Enhancing Psychiatric Interviews Through Symptom Delineation and Summarization: Pilot Study

Background: Advancements in large language models (LLMs) have opened new possibilities in psychiatric interviews, an underexplored area where LLMs could be valuable. This study focuses on enhancing psychiatric interviews by analyzing counseling data from North Korean defectors who have experienced trauma and mental health issues. Objective: The study investigates whether LLMs can (1) identify parts of conversations that suggest psychiatric symptoms and recognize those symptoms, and (2) summarize stressors and symptoms based on interview transcripts. Methods: LLMs are tasked with (1) extracting stressors from transcripts, (2) identifying symptoms and their corresponding sections, and (3) generating interview summaries using the extracted data. The transcripts were labeled by mental health experts for training and evaluation. Results: In the zero-shot inference setting using GPT-4 Turbo, 73 out of 102 segments demonstrated a recall mid-token distance d < 20 in identifying symptom-related sections. For recognizing specific symptoms, fine-tuning outperformed zero-shot inference, achieving an accuracy, precision, recall, and F1-score of 0.82. For the generative summarization task, LLMs using symptom and stressor information scored highly on G-Eval metrics: coherence (4.66), consistency (4.73), fluency (2.16), and relevance (4.67). Retrieval-augmented generation showed no notable performance improvement. Conclusions: LLMs, with fine-tuning or appropriate prompting, demonstrated strong accuracy (over 0.8) for symptom delineation and achieved high coherence (4.6+) in summarization. This study highlights their potential to assist mental health practitioners in analyzing psychiatric interviews.

翻译：背景：大型语言模型（LLM）的进展为精神科访谈开辟了新的可能性，这是一个LLM可能具有价值但尚未充分探索的领域。本研究旨在通过分析经历创伤与心理健康问题的朝鲜脱北者的心理咨询数据，以增强精神科访谈。目的：本研究探讨LLM是否能够（1）识别对话中提示精神症状的部分并识别这些症状，以及（2）基于访谈记录总结压力源与症状。方法：LLM被赋予以下任务：（1）从记录中提取压力源，（2）识别症状及其对应段落，以及（3）利用提取的数据生成访谈摘要。这些记录由心理健康专家标注，用于训练与评估。结果：在使用GPT-4 Turbo的零样本推理设置中，在识别症状相关段落的任务上，102个段落中有73个实现了召回中值标记距离d < 20。在识别特定症状方面，微调模型的表现优于零样本推理，准确率、精确率、召回率和F1分数均达到0.82。在生成式摘要任务中，利用症状和压力源信息的LLM在G-Eval指标上得分较高：连贯性（4.66）、一致性（4.73）、流畅性（2.16）和相关性（4.67）。检索增强生成未显示出显著的性能提升。结论：经过微调或适当提示的LLM在症状描述任务中表现出较强的准确性（超过0.8），并在摘要任务中实现了较高的连贯性（4.6+）。本研究凸显了其在协助心理健康从业者分析精神科访谈方面的潜力。