We present a theoretical analysis of decentralization of autoregressive generation. We define the Decentralized Discrete Flow Matching objective, by expressing probability generating velocity as a linear combination of expert flows. We also conduct experiments demonstrat- ing the equivalence between decentralized and centralized training settings for multimodal language models across diverse set of benchmarks. Specifically, we compare two distinct paradigms: LLaVA and InternVL 2.5-1B, which uses a fixed CLIP vision encoder and per- forms full-parameter fine-tuning (ViT+MLP+LLM) during the instruction tuning stage.
翻译:本文提出了自回归生成去中心化的理论分析。我们通过将概率生成速度表达为专家流的线性组合,定义了去中心化离散流匹配目标。我们还通过实验证明了,在多模态语言模型的各种基准测试中,去中心化与集中化训练设置具有等效性。具体而言,我们比较了两种不同的范式:LLaVA 以及 InternVL 2.5-1B,后者使用固定的 CLIP 视觉编码器,并在指令微调阶段进行全参数微调(ViT+MLP+LLM)。