Understanding patient feedback is crucial for improving healthcare services, yet analyzing unlabeled short-text feedback presents challenges due to limited data and domain-specific nuances. Traditional supervised approaches require extensive labeled datasets, making unsupervised methods more practical for extracting insights. This study applies unsupervised techniques to analyze 439 survey responses from a healthcare system in Wisconsin, USA. A keyword-based filter was used to isolate complaint-related feedback using a domain-specific lexicon. To identify dominant themes, we evaluated traditional topic models such as Latent Dirichlet Allocation (LDA) and Gibbs Sampling Dirichlet Multinomial Mixture (GSDMM) -- alongside BERTopic, a neural embedding-based clustering method. To improve coherence and interpretability in sparse, short-text data, we propose kBERT, which integrates BERT embeddings with k-means clustering. Model performance was assessed using coherence scores (Cv ) and average Inverted Rank-Biased Overlap (IRBOavg). kBERT achieved the highest coherence (Cv = 0.53) and topic separation (IRBOavg = 1.00), outperforming all other models. These findings highlight the value of embedding-based, context-aware models in healthcare analytics.
翻译:理解患者反馈对于改进医疗服务至关重要,然而分析未标记的短文本反馈因数据有限和领域特定细微差别而面临挑战。传统监督方法需要大量标记数据集,使得无监督方法在提取洞察方面更为实用。本研究应用无监督技术分析了来自美国威斯康星州某医疗系统的439份调查回复。使用基于关键词的过滤器,通过领域特定词典隔离了投诉相关反馈。为识别主导主题,我们评估了传统主题模型,如潜在狄利克雷分配(LDA)和吉布斯采样狄利克雷多项混合(GSDMM),以及基于神经嵌入的聚类方法BERTopic。为提高稀疏短文本数据的连贯性和可解释性,我们提出了kBERT模型,该模型将BERT嵌入与k-means聚类相结合。使用连贯性分数(Cv)和平均逆秩偏重叠度(IRBOavg)评估模型性能。kBERT获得了最高的连贯性(Cv = 0.53)和主题分离度(IRBOavg = 1.00),优于所有其他模型。这些发现凸显了基于嵌入的上下文感知模型在医疗分析中的价值。