Large Language Models for Patient Comments Multi-Label Classification

Patient experience and care quality are crucial for a hospital's sustainability and reputation. The analysis of patient feedback offers valuable insight into patient satisfaction and outcomes. However, the unstructured nature of these comments poses challenges for traditional machine learning methods following a supervised learning paradigm. This is due to the unavailability of labeled data and the nuances these texts encompass. This research explores leveraging Large Language Models (LLMs) in conducting Multi-label Text Classification (MLTC) of inpatient comments shared after a stay in the hospital. GPT-4 Turbo was leveraged to conduct the classification. However, given the sensitive nature of patients' comments, a security layer is introduced before feeding the data to the LLM through a Protected Health Information (PHI) detection framework, which ensures patients' de-identification. Additionally, using the prompt engineering framework, zero-shot learning, in-context learning, and chain-of-thought prompting were experimented with. Results demonstrate that GPT-4 Turbo, whether following a zero-shot or few-shot setting, outperforms traditional methods and Pre-trained Language Models (PLMs) and achieves the highest overall performance with an F1-score of 76.12% and a weighted F1-score of 73.61% followed closely by the few-shot learning results. Subsequently, the results' association with other patient experience structured variables (e.g., rating) was conducted. The study enhances MLTC through the application of LLMs, offering healthcare practitioners an efficient method to gain deeper insights into patient feedback and deliver prompt, appropriate responses.

翻译：患者体验与护理质量对医院的可持续发展和声誉至关重要。对患者反馈的分析为理解患者满意度与治疗结果提供了宝贵见解。然而，这些评论的非结构化特性给遵循监督学习范式的传统机器学习方法带来了挑战，这主要源于标注数据的缺乏以及文本中蕴含的细微差异。本研究探索利用大型语言模型对住院患者出院后分享的评论进行多标签文本分类。研究采用GPT-4 Turbo执行分类任务。鉴于患者评论的敏感性，在将数据输入LLM之前，通过受保护健康信息检测框架引入安全层，确保患者身份信息的脱敏处理。此外，研究基于提示工程框架，对零样本学习、上下文学习及思维链提示等策略进行了实验验证。结果表明，无论是零样本还是少样本设置，GPT-4 Turbo在整体性能上均优于传统方法和预训练语言模型，其中零样本设置取得了最高综合性能（F1分数76.12%，加权F1分数73.61%），少样本学习结果紧随其后。研究进一步分析了分类结果与其他患者体验结构化变量（如评分）的关联性。本研究通过LLM的应用推进了多标签文本分类技术的发展，为医疗从业者提供了高效分析患者反馈、及时制定应对策略的创新方法。