Transformer-based pre-trained models have emerged as the predominant solution for natural language processing (NLP). Fine-tuning such pre-trained models for downstream tasks often requires a considerable amount of labeled private data. In practice, private data is often distributed across heterogeneous mobile devices and may be prohibited from being uploaded. Moreover, well-curated labeled data is often scarce, presenting an additional challenge. To address these challenges, we first introduce a data generator for federated few-shot learning tasks, which encompasses the quantity and skewness of scarce labeled data in a realistic setting. Subsequently, we propose AUG-FedPrompt, a prompt-based federated learning system that exploits abundant unlabeled data for data augmentation. Our experiments indicate that AUG-FedPrompt can perform on par with full-set fine-tuning with a limited amount of labeled data. However, such competitive performance comes at a significant system cost.
翻译:基于Transformer的预训练模型已成为自然语言处理(NLP)的主流解决方案。针对下游任务微调此类预训练模型通常需要大量带标签的私有数据。实际应用中,私有数据常分布在异构移动设备上,且可能被禁止上传。此外,精心标注的数据往往稀缺,这带来了额外挑战。为解决这些问题,我们首先引入一个面向联邦少样本学习任务的数据生成器,该生成器涵盖了现实场景中稀缺标注数据的数量与偏斜特性。随后,我们提出AUG-FedPrompt——一种基于提示的联邦学习系统,利用大量无标注数据进行数据增强。实验表明,AUG-FedPrompt在有限标注数据下可达到与全量数据微调相当的性能。然而,这种优越性能是以显著的系统开销为代价的。