We introduce Vision Bridge Transformer (ViBT), a large-scale instantiation of Brownian Bridge Models designed for conditional generation. Unlike traditional diffusion models that transform noise into data, Bridge Models directly model the trajectory between inputs and outputs, creating an efficient data-to-data translation paradigm. By scaling these models to 20B and 1.3B parameters, we demonstrate their effectiveness for image and video translation tasks. To support this scale, we adopt a Transformer architecture and propose a variance-stabilized velocity-matching objective for robust training. Together, these advances highlight the power of scaling Bridge Models for instruction-based image editing and complex video translation.
翻译:本文提出视觉桥接变换器(ViBT),这是一种为条件生成而设计的大规模布朗桥模型实例。与传统扩散模型将噪声转化为数据不同,桥接模型直接建模输入与输出之间的轨迹,形成一种高效的数据到数据转换范式。通过将模型参数量扩展至200亿和13亿,我们证明了其在图像与视频翻译任务中的有效性。为支撑此规模,我们采用Transformer架构,并提出一种方差稳定的速度匹配目标以实现鲁棒训练。这些进展共同凸显了桥接模型在基于指令的图像编辑和复杂视频翻译任务中的规模化潜力。