We introduce a library, Dataset Grouper, to create large-scale group-structured (e.g., federated) datasets, enabling federated learning simulation at the scale of foundation models. This library allows the creation of group-structured versions of existing datasets based on user-specified partitions, and directly leads to a variety of useful heterogeneous datasets that can be plugged into existing software frameworks. Dataset Grouper offers three key advantages. First, it scales to settings where even a single group's dataset is too large to fit in memory. Second, it provides flexibility, both in choosing the base (non-partitioned) dataset and in defining partitions. Finally, it is framework-agnostic. We empirically demonstrate that Dataset Grouper allows for large-scale federated language modeling simulations on datasets that are orders of magnitude larger than in previous work. Our experimental results show that algorithms like FedAvg operate more as meta-learning methods than as empirical risk minimization methods at this scale, suggesting their utility in downstream personalization and task-specific adaptation.
翻译:我们介绍一个名为Dataset Grouper的开源库,用于创建大规模群组结构(如联邦)数据集,从而支持基础模型尺度下的联邦学习模拟。该库允许基于用户指定的分区策略,对现有数据集构建群组结构化版本,并直接生成多种可应用于现有软件框架的异构数据集。Dataset Grouper具有三大优势:首先,它能够扩展至单组数据量超出内存容量的场景;其次,在基础(非分区)数据集选择与分区定义方面均提供高度灵活性;最后,它不依赖于特定框架。实证结果表明,相比现有研究,Dataset Grouper能够在数量级更大的数据集上实现大规模联邦语言建模模拟。实验数据显示,在此规模下,FedAvg等算法的运行机制更接近元学习方法而非经验风险最小化方法,这揭示了其在下游个性化任务与特定领域自适应中的潜在应用价值。