The reliance of language model training on massive amounts of computation and vast datasets scraped from potentially low-quality, copyrighted, or sensitive data has come into question practically, legally, and ethically. Federated learning provides a plausible alternative by enabling previously untapped data to be voluntarily gathered from collaborating organizations. However, when scaled globally, federated learning requires collaboration across heterogeneous legal, security, and privacy regimes while accounting for the inherent locality of language data; this further exacerbates the established challenge of federated statistical heterogeneity. We propose a Worldwide Federated Language Model Training~(WorldLM) system based on federations of federations, where each federation has the autonomy to account for factors such as its industry, operating jurisdiction, or competitive environment. WorldLM enables such autonomy in the presence of statistical heterogeneity via partial model localization by allowing sub-federations to attentively aggregate key layers from their constituents. Furthermore, it can adaptively share information across federations via residual layer embeddings. Evaluations of language modeling on naturally heterogeneous datasets show that WorldLM outperforms standard federations by up to $1.91\times$, approaches the personalized performance of fully local models, and maintains these advantages under privacy-enhancing techniques.
翻译:语言模型训练对海量计算资源及从潜在低质量、受版权保护或敏感数据中抓取的大规模数据集的依赖,已在实践、法律和伦理层面受到质疑。联邦学习通过允许从合作机构自愿收集先前未被利用的数据,提供了一种可行的替代方案。然而,当扩展到全球范围时,联邦学习需要在异构的法律、安全与隐私制度下进行协作,同时兼顾语言数据固有的地域性;这进一步加剧了联邦学习中已存在的统计异质性挑战。我们提出了一种基于联邦联盟的全球联邦语言模型训练系统(WorldLM),其中每个联邦联盟可自主考虑其行业、管辖区域或竞争环境等因素。WorldLM通过部分模型本地化,允许子联邦联盟对其成员的关键层进行注意力聚合,从而在统计异质性条件下实现此类自主性。此外,该系统能通过残差层嵌入自适应地在联邦联盟间共享信息。在自然异质性数据集上的语言建模评估表明,WorldLM性能超越标准联邦方法达$1.91\times$,接近完全本地模型的个性化性能,并在隐私增强技术下保持这些优势。