Towards Federated Foundation Models: Scalable Dataset Pipelines for Group-Structured Learning

We introduce a library, Dataset Grouper, to create large-scale group-structured (e.g., federated) datasets, enabling federated learning simulation at the scale of foundation models. This library allows the creation of group-structured versions of existing datasets based on user-specified partitions, and directly leads to a variety of useful heterogeneous datasets that can be plugged into existing software frameworks. Dataset Grouper offers three key advantages. First, it scales to settings where even a single group's dataset is too large to fit in memory. Second, it provides flexibility, both in choosing the base (non-partitioned) dataset and in defining partitions. Finally, it is framework-agnostic. We empirically demonstrate that Dataset Grouper allows for large-scale federated language modeling simulations on datasets that are orders of magnitude larger than in previous work. Our experimental results show that algorithms like FedAvg operate more as meta-learning methods than as empirical risk minimization methods at this scale, suggesting their utility in downstream personalization and task-specific adaptation.

翻译：我们介绍一个名为Dataset Grouper的开源库，用于创建大规模群组结构（如联邦）数据集，从而支持基础模型尺度下的联邦学习模拟。该库允许基于用户指定的分区策略，对现有数据集构建群组结构化版本，并直接生成多种可应用于现有软件框架的异构数据集。Dataset Grouper具有三大优势：首先，它能够扩展至单组数据量超出内存容量的场景；其次，在基础（非分区）数据集选择与分区定义方面均提供高度灵活性；最后，它不依赖于特定框架。实证结果表明，相比现有研究，Dataset Grouper能够在数量级更大的数据集上实现大规模联邦语言建模模拟。实验数据显示，在此规模下，FedAvg等算法的运行机制更接近元学习方法而非经验风险最小化方法，这揭示了其在下游个性化任务与特定领域自适应中的潜在应用价值。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日