Free text comments (FTC) in patient-reported outcome measures (PROMs) data are typically analysed using manual methods, such as content analysis, which is labour-intensive and time-consuming. Machine learning analysis methods are largely unsupervised, necessitating post-analysis interpretation. Weakly supervised text classification (WSTC) can be a valuable method of analysis to classify domain-specific text data in which there is limited labelled data. In this paper, we apply five WSTC techniques to FTC in PROMs data to identify health-related quality of life (HRQoL) themes reported by colorectal cancer patients. The WSTC methods label all the themes mentioned in the FTC. The results showed moderate performance on the PROMs data, mainly due to the precision of the models, and variation between themes. Evaluation of the classification performance illustrated the potential and limitations of keyword based WSTC to label PROMs FTC when labelled data is limited.
翻译:患者报告结局指标(PROMs)数据中的自由文本评论(FTC)通常采用内容分析等人工方法进行分析,该方法劳动强度大且耗时。机器学习分析方法大多属于无监督学习,需要进行分析后解读。弱监督文本分类(WSTC)可成为一种有价值的分析方法,用于对标记数据有限的领域特定文本数据进行分类。本文针对PROMs数据中的FTC应用了五种弱监督文本分类技术,以识别结直肠癌患者报告的健康相关生活质量(HRQoL)主题。这些WSTC方法对FTC中提及的所有主题进行标注。结果显示,该方法在PROMs数据上的表现中等,主要受限于模型精度及主题间的差异性。分类性能评估揭示了基于关键词的WSTC在标记数据有限的情况下对PROMs FTC进行标注的潜力与局限性。