Leveraging deep active learning to identify low-resource mobility functioning information in public clinical notes

Function is increasingly recognized as an important indicator of whole-person health, although it receives little attention in clinical natural language processing research. We introduce the first public annotated dataset specifically on the Mobility domain of the International Classification of Functioning, Disability and Health (ICF), aiming to facilitate automatic extraction and analysis of functioning information from free-text clinical notes. We utilize the National NLP Clinical Challenges (n2c2) research dataset to construct a pool of candidate sentences using keyword expansion. Our active learning approach, using query-by-committee sampling weighted by density representativeness, selects informative sentences for human annotation. We train BERT and CRF models, and use predictions from these models to guide the selection of new sentences for subsequent annotation iterations. Our final dataset consists of 4,265 sentences with a total of 11,784 entities, including 5,511 Action entities, 5,328 Mobility entities, 306 Assistance entities, and 639 Quantification entities. The inter-annotator agreement (IAA), averaged over all entity types, is 0.72 for exact matching and 0.91 for partial matching. We also train and evaluate common BERT models and state-of-the-art Nested NER models. The best F1 scores are 0.84 for Action, 0.7 for Mobility, 0.62 for Assistance, and 0.71 for Quantification. Empirical results demonstrate promising potential of NER models to accurately extract mobility functioning information from clinical text. The public availability of our annotated dataset will facilitate further research to comprehensively capture functioning information in electronic health records (EHRs).

翻译：功能日益被认可为全人健康的重要指标，尽管在临床自然语言处理研究中未得到足够关注。我们首次发布了专门针对《国际功能、残疾和健康分类》（ICF）中移动领域的公共标注数据集，旨在促进从自由文本临床笔记中自动提取和分析功能信息。我们利用国家自然语言处理临床挑战（n2c2）研究数据集，通过关键词扩展构建候选句库。采用基于密度代表性子加权查询委员会采样的主动学习方法，选取信息量丰富的句子进行人工标注。我们训练了BERT和CRF模型，并利用这些模型的预测结果指导后续标注迭代中新句子的选取。最终数据集包含4,265条语句，共计11,784个实体，包括5,511个行动实体、5,328个移动实体、306个辅助实体和639个量化实体。所有实体类型的标注者间一致性（IAA）精确匹配为0.72，部分匹配为0.91。我们还训练并评估了常用BERT模型和最新的嵌套命名实体识别（Nested NER）模型。最佳F1分数分别为：行动0.84、移动0.7、辅助0.62、量化0.71。实证结果表明，NER模型在从临床文本中准确提取移动功能信息方面具有显著潜力。我们标注数据集的公开可用性将促进进一步研究，以全面捕获电子健康记录（EHR）中的功能信息。