VIBE: Visual Instruction Based Editor

Grigorii Alekseenko,Aleksandr Gordeev,Irina Tolstykh,Bulat Suleimanov,Vladimir Dokholyan,Georgii Fedorov,Sergey Yakubson,Aleksandra Tsybina,Mikhail Chernyshov,Maksim Kuprashevich

Instruction-based image editing is among the fastest developing areas in generative AI. Over the past year, the field has reached a new level, with dozens of open-source models released alongside highly capable commercial systems. However, only a limited number of open-source approaches currently achieve real-world quality. In addition, diffusion backbones, the dominant choice for these pipelines, are often large and computationally expensive for many deployments and research settings, with widely used variants typically containing 6B to 20B parameters. This paper presents a compact, high-throughput instruction-based image editing pipeline that uses a modern 2B-parameter Qwen3-VL model to guide the editing process and the 1.6B-parameter diffusion model Sana1.5 for image generation. Our design decisions across architecture, data processing, training configuration, and evaluation target low-cost inference and strict source consistency while maintaining high quality across the major edit categories feasible at this scale. Evaluated on the ImgEdit and GEdit benchmarks, the proposed method matches or exceeds the performance of substantially heavier baselines, including models with several times as many parameters and higher inference cost, and is particularly strong on edits that require preserving the input image, such as an attribute adjustment, object removal, background edits, and targeted replacement. The model fits within 24 GB of GPU memory and generates edited images at up to 2K resolution in approximately 4 seconds on an NVIDIA H100 in BF16, without additional inference optimizations or distillation.

翻译：基于指令的图像编辑是生成式人工智能中发展最快的领域之一。过去一年，该领域已达到新水平，数十个开源模型与高性能商业系统一同发布。然而，目前仅有有限数量的开源方法能达到实际应用质量。此外，作为这些流程主流选择的扩散模型骨干网络通常规模庞大且计算成本高昂，广泛使用的变体通常包含60亿至200亿参数，难以适用于许多部署和研究场景。本文提出了一种紧凑、高吞吐的基于指令图像编辑流程，该流程使用具有20亿参数的现代Qwen3-VL模型指导编辑过程，并采用16亿参数的扩散模型Sana1.5进行图像生成。我们在架构设计、数据处理、训练配置和评估目标上均以低成本推理和严格源一致性为目标，同时在此规模可行的主要编辑类别中保持高质量。在ImgEdit和GEdit基准测试中，所提方法的性能匹配或超越了参数数量数倍、推理成本显著更高的基线模型，在需要保持输入图像完整性的编辑任务上表现尤为突出，例如属性调整、对象移除、背景编辑和定向替换。该模型可在24GB GPU内存内运行，在NVIDIA H100上以BF16精度生成高达2K分辨率的编辑图像仅需约4秒，且无需额外的推理优化或蒸馏处理。