We present Paris, the first publicly released diffusion model pre-trained entirely through decentralized computation. Paris demonstrates that high-quality text-to-image generation can be achieved without centrally coordinated infrastructure. Paris is open for research and commercial use. Paris required implementing our Distributed Diffusion Training framework from scratch. The model consists of 8 expert diffusion models (129M-605M parameters each) trained in complete isolation with no gradient, parameter, or intermediate activation synchronization. Rather than requiring synchronized gradient updates across thousands of GPUs, we partition data into semantically coherent clusters where each expert independently optimizes its subset while collectively approximating the full distribution. A lightweight transformer router dynamically selects appropriate experts at inference, achieving generation quality comparable to centrally coordinated baselines. Eliminating synchronization enables training on heterogeneous hardware without specialized interconnects. Empirical validation confirms that Paris's decentralized training maintains generation quality while removing the dedicated GPU cluster requirement for large-scale diffusion models. Paris achieves this using 14$\times$ less training data and 16$\times$ less compute than the prior decentralized baseline.
翻译:我们提出了Paris,这是首个完全通过去中心化计算预训练并公开发布的扩散模型。Paris证明了高质量的文本到图像生成无需依赖中心化协调的基础设施即可实现。Paris面向研究和商业用途开放。Paris的开发需要我们从零开始实现分布式扩散训练框架。该模型由8个专家扩散模型组成(每个模型参数量为1.29亿至6.05亿),这些模型在完全隔离的环境下训练,无需梯度、参数或中间激活值的同步。我们摒弃了在数千个GPU上进行同步梯度更新的传统方式,而是将数据划分为语义连贯的聚类,每个专家模型独立优化其数据子集,同时整体逼近完整数据分布。一个轻量级Transformer路由器在推理阶段动态选择合适的专家模型,实现了与中心化协调基线相当的生成质量。消除同步需求使得模型能够在异构硬件上训练,无需专用互联设备。实证验证表明,Paris的去中心化训练在保持生成质量的同时,消除了大规模扩散模型对专用GPU集群的依赖。与先前的去中心化基线相比,Paris仅使用1/14的训练数据和1/16的计算量就实现了这一目标。