We propose UniDFlow, a unified discrete flow-matching framework for multimodal understanding, generation, and editing. It decouples understanding and generation via task-specific low-rank adapters, avoiding objective interference and representation entanglement, while a novel reference-based multimodal preference alignment optimizes relative outcomes under identical conditioning, improving faithfulness and controllability without large-scale retraining. UniDFlpw achieves SOTA performance across eight benchmarks and exhibits strong zero-shot generalization to tasks including inpainting, in-context image generation, reference-based editing, and compositional generation, despite no explicit task-specific training.
翻译:本文提出UniDFlow,一种用于多模态理解、生成与编辑的统一离散流匹配框架。该框架通过任务特定的低秩适配器实现理解与生成的解耦,避免了目标干扰与表征纠缠;同时,一种新颖的基于参考的多模态偏好对齐方法在相同条件下优化相对输出,无需大规模重新训练即可提升生成结果的忠实度与可控性。UniDFlow在八个基准测试中均达到最先进性能,并在未进行显式任务特定训练的情况下,展现出对图像修复、上下文图像生成、基于参考的编辑以及组合生成等任务的强大零样本泛化能力。