The introduction of computerized medical records in hospitals has reduced burdensome activities like manual writing and information fetching. However, the data contained in medical records are still far underutilized, primarily because extracting data from unstructured textual medical records takes time and effort. Information Extraction, a subfield of Natural Language Processing, can help clinical practitioners overcome this limitation by using automated text-mining pipelines. In this work, we created the first Italian neuropsychiatric Named Entity Recognition dataset, PsyNIT, and used it to develop a Transformers-based model. Moreover, we collected and leveraged three external independent datasets to implement an effective multicenter model, with overall F1-score 84.77%, Precision 83.16%, Recall 86.44%. The lessons learned are: (i) the crucial role of a consistent annotation process and (ii) a fine-tuning strategy that combines classical methods with a "low-resource" approach. This allowed us to establish methodological guidelines that pave the way for Natural Language Processing studies in less-resourced languages.
翻译:医院引入电子病历系统减少了手动书写和信息检索等繁重工作。然而,病历中的数据仍远未得到充分利用,主要原因在于从非结构化的文本病历中提取数据耗时耗力。作为自然语言处理的子领域,信息抽取可通过自动化文本挖掘流程帮助临床医生克服这一局限。本研究构建了首个意大利语神经精神疾病命名实体识别数据集PsyNIT,并基于该数据集开发了Transformer模型。此外,我们收集并利用了三个外部独立数据集,实现了一个有效的多中心模型,整体F1分数达84.77%,精确率83.16%,召回率86.44%。研究经验表明:(i) 一致的标注流程至关重要;(ii) 将经典方法与"低资源"策略相结合的微调方法行之有效。据此我们建立了方法论指南,为资源稀缺语言的自然语言处理研究铺平了道路。