We introduce Dataset Grouper, a library to create large-scale group-structured (e.g., federated) datasets, enabling federated learning simulation at the scale of foundation models. This library facilitates the creation of group-structured versions of existing datasets based on user-specified partitions and directly leads to a variety of useful heterogeneous datasets that can be plugged into existing software frameworks. Dataset Grouper offers three key advantages. First, it scales to settings where even a single group's dataset is too large to fit in memory. Second, it provides flexibility, both in choosing the base (non-partitioned) dataset and in defining partitions. Finally, it is framework-agnostic. We empirically demonstrate that Dataset Grouper enables large-scale federated language modeling simulations on datasets that are orders of magnitude larger than in previous work, allowing for federated training of language models with hundreds of millions, and even billions, of parameters. Our experimental results show that algorithms like FedAvg operate more as meta-learning methods than as empirical risk minimization methods at this scale, suggesting their utility in downstream personalization and task-specific adaptation. Dataset Grouper is available at https://github.com/google-research/dataset_grouper.
翻译:我们提出了Dataset Grouper——一个用于创建大规模组结构(如联邦)数据集的库,能够支持基础模型规模的联邦学习仿真。该库基于用户指定的数据分区,将现有数据集转化为具有组结构的版本,并可直接生成多种有用的异质性数据集,无缝集成到现有软件框架中。Dataset Grouper具有三大优势:首先,它可扩展至单个组数据集过大而无法完全载入内存的场景;其次,它灵活支持基础(非分区)数据集的选择与分区定义;最后,它独立于特定框架。实验表明,Dataset Grouper能够在比以往研究规模大数个数量级的数据集上开展大规模联邦语言建模仿真,实现数亿乃至数十亿参数语言模型的联邦训练。我们的实验结果显示,在此规模下,如FedAvg等算法的作用更接近元学习方法而非经验风险最小化方法,这表明其在下游个性化与任务自适应中的潜力。Dataset Grouper的开源地址为:https://github.com/google-research/dataset_grouper。