Strong semantic representations improve the convergence and generation quality of diffusion and flow models. Existing approaches largely rely on external models, which require separate training, operate on misaligned objectives, and exhibit unexpected scaling behavior. We argue that this dependence arises from the model's training objective, which poses a denoising task with little incentive to learn semantic representations. We introduce Self-Flow: a self-supervised flow matching paradigm that integrates representation learning within the generative framework. Our key mechanism, Dual-Timestep Scheduling, applies heterogeneous noise levels across tokens, creating an information asymmetry that forces the model to infer missing information from corrupted inputs. This drives learning strong representations alongside generative capabilities without external supervision. Our method generalizes across modalities and enables multi-modal training while following expected scaling laws, achieving superior image, video, and audio generation.
翻译:强大的语义表示能够提升扩散模型与流模型的收敛速度与生成质量。现有方法主要依赖外部模型,这些模型需要独立训练、目标函数存在偏差,且呈现出非预期的缩放特性。我们认为这种依赖性源于模型训练目标本身——其设定的去噪任务缺乏学习语义表示的内在驱动力。本文提出Self-Flow:一种将表示学习整合到生成框架中的自监督流匹配范式。我们的核心机制——双时间步调度——通过对不同令牌施加异构噪声水平,构建信息不对称性,迫使模型从受损输入中推断缺失信息。该方法在无需外部监督的条件下,驱动模型同步学习强语义表示与生成能力。我们的方法具有跨模态泛化特性,支持多模态联合训练,同时遵循预期的缩放规律,在图像、视频及音频生成任务中均取得了优越性能。