The introduction of computerized medical records in hospitals has reduced burdensome operations like manual writing and information fetching. However, the data contained in medical records are still far underutilized, primarily because extracting them from unstructured textual medical records takes time and effort. Information Extraction, a subfield of Natural Language Processing, can help clinical practitioners overcome this limitation, using automated text-mining pipelines. In this work, we created the first Italian neuropsychiatric Named Entity Recognition dataset, PsyNIT, and used it to develop a Large Language Model for this task. Moreover, we conducted several experiments with three external independent datasets to implement an effective multicenter model, with overall F1-score 84.77%, Precision 83.16%, Recall 86.44%. The lessons learned are: (i) the crucial role of a consistent annotation process and (ii) a fine-tuning strategy that combines classical methods with a "few-shot" approach. This allowed us to establish methodological guidelines that pave the way for future implementations in this field and allow Italian hospitals to tap into important research opportunities.
翻译:医院引入电子病历系统减少了人工书写和信息检索等繁重操作,但病历中包含的数据仍远未得到充分利用,主要原因是从非结构化文本病历中提取数据耗时耗力。信息抽取作为自然语言处理的子领域,可通过自动化文本挖掘流程帮助临床医生克服这一局限。本研究创建了首个意大利语神经精神医学命名实体识别数据集PsyNIT,并基于此开发了针对该任务的大型语言模型。此外,我们利用三个独立的外部数据集开展了多项实验,构建了有效的多中心模型,整体F1分数达84.77%,精确率83.16%,召回率86.44%。获得的经验教训包括:(i)一致的标注流程至关重要,(ii)微调策略需结合经典方法与“少样本”方法。这使我们能够建立方法论指南,为该领域的未来实施铺平道路,并使意大利医院能够把握重要的研究机遇。