Contextual Embedding-based Clustering to Identify Topics for Healthcare Service Improvement

Understanding patient feedback is crucial for improving healthcare services, yet analyzing unlabeled short-text feedback presents significant challenges due to limited data and domain-specific nuances. Traditional supervised learning approaches require extensive labeled datasets, making unsupervised methods more viable for uncovering meaningful insights from patient feedback. This study explores unsupervised methods to extract meaningful topics from 439 survey responses collected from a healthcare system in Wisconsin, USA. A keyword-based filtering approach was applied to isolate complaint-related feedback using a domain-specific lexicon. To delve deeper and analyze dominant topics in feedback, we explored traditional topic modeling methods, including Latent Dirichlet Allocation (LDA) and Gibbs Sampling Dirichlet Multinomial Mixture (GSDMM), alongside BERTopic, an advanced neural embedding-based clustering approach. To improve coherence and interpretability where data are scarce and consist of short-texts, we propose kBERT, an integration of BERT embeddings with k-means clustering. Model performance was assessed using coherence scores (Cv ) for topic interpretability and average Inverted Rank-Biased Overlap (IRBOavg) for topic diversity. Results indicate that kBERT achieves the highest coherence (Cv = 0.53) and distinct topic separation (IRBOavg = 1.00), outperforming all other models in short-text healthcare feedback analysis. Our findings emphasize the importance of embedding-based techniques for topic identification and highlight the need for context-aware models in healthcare analytics.

翻译：理解患者反馈对于改进医疗服务至关重要，然而分析未标记的短文本反馈因数据有限和领域特定细微差别而面临重大挑战。传统的监督学习方法需要大量标记数据集，这使得无监督方法在从患者反馈中挖掘有意义见解方面更具可行性。本研究探索了无监督方法，从美国威斯康星州某医疗系统收集的439份调查反馈中提取有意义主题。采用基于关键词的过滤方法，结合领域特定词典筛选出投诉相关反馈。为深入分析反馈中的主导主题，我们探索了传统主题建模方法，包括潜在狄利克雷分配（LDA）和吉布斯采样狄利克雷多项式混合（GSDMM），以及基于神经嵌入的先进聚类方法BERTopic。针对数据稀缺且由短文本组成的情况，为提升主题连贯性和可解释性，我们提出kBERT模型——一种将BERT嵌入与k均值聚类相结合的方法。通过主题连贯性得分（Cv）评估主题可解释性，通过平均逆排序偏置重叠度（IRBOavg）评估主题多样性。结果表明，kBERT在短文本医疗反馈分析中取得最高连贯性（Cv = 0.53）和最优主题区分度（IRBOavg = 1.00），性能优于所有其他模型。我们的研究发现强调了基于嵌入的技术在主题识别中的重要性，并凸显了医疗分析领域对情境感知模型的迫切需求。