The decentralization of autoregressive generation has attracted considerable attention in recent years as a solution to scaling bottlenecks. However, despite promising empirical results, this paradigm currently lacks rigorous theoretical justification. In this work, we formally establish the theoretical equivalence between decentralized and centralized training. To achieve this, we adapt the Discrete Flow Matching framework for autoregressive generation, leveraging its inherent properties to demonstrate that global models naturally decompose into independent experts. Finally, we conduct extensive experiments across diverse multimodal benchmarks, empirically validating that decentralized training maintains competitive parity with standard centralized architectures.
翻译:近年来,自回归生成的去中心化作为解决扩展瓶颈的方案引起了广泛关注。然而,尽管取得了令人鼓舞的经验结果,该范式目前仍缺乏严格的理论依据。在本工作中,我们正式建立了去中心化训练与中心化训练之间的理论等价性。为此,我们针对自回归生成改造了离散流匹配框架,利用其固有特性证明全局模型可自然地分解为独立专家。最终,我们在多个多模态基准上进行了广泛实验,经验验证了去中心化训练与标准中心化架构保持竞争性等价。