A Systematic Evaluation of Federated Learning on Biomedical Natural Language Processing

Language models (LMs) like BERT and GPT have revolutionized natural language processing (NLP). However, privacy-sensitive domains, particularly the medical field, face challenges to train LMs due to limited data access and privacy constraints imposed by regulations like the Health Insurance Portability and Accountability Act (HIPPA) and the General Data Protection Regulation (GDPR). Federated learning (FL) offers a decentralized solution that enables collaborative learning while ensuring the preservation of data privacy. In this study, we systematically evaluate FL in medicine across $2$ biomedical NLP tasks using $6$ LMs encompassing $8$ corpora. Our results showed that: 1) FL models consistently outperform LMs trained on individual client's data and sometimes match the model trained with polled data; 2) With the fixed number of total data, LMs trained using FL with more clients exhibit inferior performance, but pre-trained transformer-based models exhibited greater resilience. 3) LMs trained using FL perform nearly on par with the model trained with pooled data when clients' data are IID distributed while exhibiting visible gaps with non-IID data. Our code is available at: https://github.com/PL97/FedNLP

翻译：语言模型（如BERT和GPT）彻底改变了自然语言处理领域。然而，隐私敏感领域，尤其是医疗领域，由于数据访问受限以及《健康保险便携性与责任法案》和《通用数据保护条例》等法规施加的隐私约束，在训练语言模型时面临挑战。联邦学习提供了一种去中心化的解决方案，能够在确保数据隐私的同时实现协作学习。在本研究中，我们系统评估了联邦学习在医学领域的表现，涉及2个生物医学NLP任务，使用6个语言模型，涵盖8个语料库。我们的结果表明：1）联邦学习模型始终优于仅在单个客户端数据上训练的语言模型，有时甚至能与使用汇集数据训练的模型相媲美；2）在总数据量固定的情况下，使用联邦学习训练的模型在客户端数量增多时表现下降，但基于预训练Transformer的模型表现出更强的韧性；3）当客户端数据独立同分布时，使用联邦学习训练的语言模型性能几乎与使用汇集数据训练的模型持平，但在非独立同分布数据上则表现出明显差距。我们的代码可在以下网址获取：https://github.com/PL97/FedNLP

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日