Language models (LMs) like BERT and GPT have revolutionized natural language processing (NLP). However, privacy-sensitive domains, particularly the medical field, face challenges to train LMs due to limited data access and privacy constraints imposed by regulations like the Health Insurance Portability and Accountability Act (HIPPA) and the General Data Protection Regulation (GDPR). Federated learning (FL) offers a decentralized solution that enables collaborative learning while ensuring the preservation of data privacy. In this study, we systematically evaluate FL in medicine across $2$ biomedical NLP tasks using $6$ LMs encompassing $8$ corpora. Our results showed that: 1) FL models consistently outperform LMs trained on individual client's data and sometimes match the model trained with polled data; 2) With the fixed number of total data, LMs trained using FL with more clients exhibit inferior performance, but pre-trained transformer-based models exhibited greater resilience. 3) LMs trained using FL perform nearly on par with the model trained with pooled data when clients' data are IID distributed while exhibiting visible gaps with non-IID data. Our code is available at: https://github.com/PL97/FedNLP
翻译:语言模型(如BERT和GPT)彻底改变了自然语言处理领域。然而,隐私敏感领域,尤其是医疗领域,由于数据访问受限以及《健康保险便携性与责任法案》和《通用数据保护条例》等法规施加的隐私约束,在训练语言模型时面临挑战。联邦学习提供了一种去中心化的解决方案,能够在确保数据隐私的同时实现协作学习。在本研究中,我们系统评估了联邦学习在医学领域的表现,涉及2个生物医学NLP任务,使用6个语言模型,涵盖8个语料库。我们的结果表明:1)联邦学习模型始终优于仅在单个客户端数据上训练的语言模型,有时甚至能与使用汇集数据训练的模型相媲美;2)在总数据量固定的情况下,使用联邦学习训练的模型在客户端数量增多时表现下降,但基于预训练Transformer的模型表现出更强的韧性;3)当客户端数据独立同分布时,使用联邦学习训练的语言模型性能几乎与使用汇集数据训练的模型持平,但在非独立同分布数据上则表现出明显差距。我们的代码可在以下网址获取:https://github.com/PL97/FedNLP