Automatic extraction of medical information from clinical documents poses several challenges: high costs of required clinical expertise, limited interpretability of model predictions, restricted computational resources and privacy regulations. Recent advances in domain-adaptation and prompting methods showed promising results with minimal training data using lightweight masked language models, which are suited for well-established interpretability methods. We are first to present a systematic evaluation of these methods in a low-resource setting, by performing multi-class section classification on German doctor's letters. We conduct extensive class-wise evaluations supported by Shapley values, to validate the quality of our small training data set and to ensure the interpretability of model predictions. We demonstrate that a lightweight, domain-adapted pretrained model, prompted with just 20 shots, outperforms a traditional classification model by 30.5% accuracy. Our results serve as a process-oriented guideline for clinical information extraction projects working with low-resource.
翻译:从临床文档中自动提取医疗信息面临多重挑战:所需临床专业知识成本高昂、模型预测可解释性有限、计算资源受限以及隐私法规约束。近期领域自适应与提示学习方法的研究进展表明,采用轻量级掩码语言模型仅需极少量训练数据即可取得优异效果,且此类模型与成熟的可解释性方法具有良好兼容性。本研究首次在低资源场景下系统评估这些方法,通过对德语医生信件进行多类别章节分类实现。我们采用沙普利值支持的大规模分类别评估,以验证小型训练数据集的质量并确保模型预测的可解释性。实验证明,经领域自适应的轻量级预训练模型仅需20个提示样本,其分类准确率即超越传统分类模型30.5%。本研究成果可为低资源条件下的临床信息抽取项目提供流程化实施指南。