Patient-generated text such as secure messages, surveys, and interviews contains rich expressions of the patient voice (PV), reflecting communicative behaviors and social determinants of health (SDoH). Traditional qualitative coding frameworks are labor intensive and do not scale to large volumes of patient-authored messages across health systems. Existing machine learning (ML) and natural language processing (NLP) approaches provide partial solutions but often treat patient-centered communication (PCC) and SDoH as separate tasks or rely on models not well suited to patient-facing language. We introduce PVminer, a domain-adapted NLP framework for structuring patient voice in secure patient-provider communication. PVminer formulates PV detection as a multi-label, multi-class prediction task integrating patient-specific BERT encoders (PV-BERT-base and PV-BERT-large), unsupervised topic modeling for thematic augmentation (PV-Topic-BERT), and fine-tuned classifiers for Code, Subcode, and Combo-level labels. Topic representations are incorporated during fine-tuning and inference to enrich semantic inputs. PVminer achieves strong performance across hierarchical tasks and outperforms biomedical and clinical pre-trained baselines, achieving F1 scores of 82.25% (Code), 80.14% (Subcode), and up to 77.87% (Combo). An ablation study further shows that author identity and topic-based augmentation each contribute meaningful gains. Pre-trained models, source code, and documentation will be publicly released, with annotated datasets available upon request for research use.
翻译:患者生成的文本(如安全消息、调查和访谈)包含丰富的患者声音表达,反映了沟通行为和健康的社会决定因素。传统的定性编码框架劳动密集,难以扩展到跨医疗系统的大量患者撰写消息。现有的机器学习和自然语言处理方法提供了部分解决方案,但通常将患者中心沟通和健康的社会决定因素视为独立任务,或依赖不适用于面向患者语言的模型。我们提出PVminer,这是一个用于在安全医患沟通中结构化患者声音的领域自适应自然语言处理框架。PVminer将患者声音检测制定为多标签、多类别预测任务,整合了患者专用BERT编码器、用于主题增强的无监督主题建模,以及针对代码、子代码和组合级别标签的微调分类器。主题表示在微调和推理过程中被纳入以丰富语义输入。PVminer在分层任务中表现出色,优于生物医学和临床预训练基线,在代码、子代码和组合级别分别达到82.25%、80.14%和最高77.87%的F1分数。消融研究进一步表明,作者身份和基于主题的增强各自带来显著性能提升。预训练模型、源代码和文档将公开发布,带标注数据集可根据研究需求提供。