Towards Federated Foundation Models: Scalable Dataset Pipelines for Group-Structured Learning

We introduce Dataset Grouper, a library to create large-scale group-structured (e.g., federated) datasets, enabling federated learning simulation at the scale of foundation models. This library facilitates the creation of group-structured versions of existing datasets based on user-specified partitions and directly leads to a variety of useful heterogeneous datasets that can be plugged into existing software frameworks. Dataset Grouper offers three key advantages. First, it scales to settings where even a single group's dataset is too large to fit in memory. Second, it provides flexibility, both in choosing the base (non-partitioned) dataset and in defining partitions. Finally, it is framework-agnostic. We empirically demonstrate that Dataset Grouper enables large-scale federated language modeling simulations on datasets that are orders of magnitude larger than in previous work, allowing for federated training of language models with hundreds of millions, and even billions, of parameters. Our experimental results show that algorithms like FedAvg operate more as meta-learning methods than as empirical risk minimization methods at this scale, suggesting their utility in downstream personalization and task-specific adaptation. Dataset Grouper is available at https://github.com/google-research/dataset_grouper.

翻译：我们提出了Dataset Grouper——一个用于创建大规模组结构（如联邦）数据集的库，能够支持基础模型规模的联邦学习仿真。该库基于用户指定的数据分区，将现有数据集转化为具有组结构的版本，并可直接生成多种有用的异质性数据集，无缝集成到现有软件框架中。Dataset Grouper具有三大优势：首先，它可扩展至单个组数据集过大而无法完全载入内存的场景；其次，它灵活支持基础（非分区）数据集的选择与分区定义；最后，它独立于特定框架。实验表明，Dataset Grouper能够在比以往研究规模大数个数量级的数据集上开展大规模联邦语言建模仿真，实现数亿乃至数十亿参数语言模型的联邦训练。我们的实验结果显示，在此规模下，如FedAvg等算法的作用更接近元学习方法而非经验风险最小化方法，这表明其在下游个性化与任务自适应中的潜力。Dataset Grouper的开源地址为：https://github.com/google-research/dataset_grouper。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日