End-to-end backpropagation requires storing activations throughout all layers, creating memory bottlenecks that limit model scalability. Existing block-wise training methods offer means to alleviate this problem, but they rely on ad-hoc local objectives and remain largely unexplored beyond classification tasks. We propose $\textit{DiffusionBlocks}$, a principled framework for transforming transformer-based networks into genuinely independent trainable blocks that maintain competitive performance with end-to-end training. Our key insight leverages the fact that residual connections naturally correspond to updates in a dynamical system. With minimal modifications to this system, we can convert the updates to those of a denoising process, where each block can be learned independently by leveraging the score matching objective. This independence enables training with gradients for only one block at a time, thereby reducing memory requirements in proportion to the number of blocks. Our experiments on a range of transformer architectures (vision, diffusion, autoregressive, recurrent-depth, and masked diffusion) demonstrate that DiffusionBlocks training matches the performance of end-to-end training while enabling scalable block-wise training on practical tasks beyond small-scale classification. DiffusionBlocks provides a theoretically grounded approach that successfully scales to modern generative tasks across diverse architectures. Code is available at https://github.com/SakanaAI/DiffusionBlocks .
翻译:端到端反向传播需要存储所有层的激活值,这造成了内存瓶颈,限制了模型的可扩展性。现有的块级训练方法提供了缓解此问题的途径,但它们依赖于临时性的局部目标,且在分类任务之外的应用仍鲜有探索。我们提出 $\textit{DiffusionBlocks}$,这是一个原则性框架,可将基于 Transformer 的网络转换为真正独立的可训练块,同时保持与端到端训练相竞争的性能。我们的核心洞见在于利用残差连接自然对应于动力系统更新这一事实。通过对该系统进行最小修改,我们可以将更新转换为去噪过程的更新,其中每个块都可以通过利用分数匹配目标独立学习。这种独立性使得每次仅需计算一个块的梯度即可进行训练,从而将内存需求按块数比例降低。我们在多种 Transformer 架构(视觉、扩散、自回归、循环深度和掩码扩散)上进行实验,结果表明 DiffusionBlocks 训练在性能上匹配端到端训练,同时能在小规模分类之外的实际任务上实现可扩展的块级训练。DiffusionBlocks 提供了一种理论基础扎实的方法,成功扩展到跨多种架构的现代生成任务。代码可在 https://github.com/SakanaAI/DiffusionBlocks 获取。