Language-guided photo retouching aims to adjust color and tone while preserving geometry and texture. Recently, diffusion-based retouching shows a superior visual quality, but often struggles with both fidelity issues due to its generative nature and efficiency because of its iterative sampling process. In this work, we propose an efficient and fidelity-preserving retouching method using bilateral space manipulation, which is both compact and content-decoupled. Specifically, instead of directly editing pixels or image latents, our model predicts a low-resolution bilateral grid of affine transforms, which are sliced using a learned guidance map and then applied to the full-resolution image. This approach yields both high fidelity and improved efficiency. To retain strong priors of a pretrained generative model, we distill a multi-step diffusion model into our bilateral grid framework using Variational Score Distillation, complemented by a prompt alignment loss to guide instruction-following behavior. Additionally, we introduce a new benchmark and evaluate our method across multiple dimensions: fidelity, instruction following, and efficiency. Compared to the latest retouch methods, like Gemini-2.5-Flash (Nano-Banana), our method can avoid content drift, significantly improve latency, and generate visually pleasing edits, while maintaining a high level of fidelity. Project page: https://openimaginglab.github.io/InstantRetouch/.
翻译:语言引导的图像润色旨在调整色彩与色调的同时保留几何结构与纹理。近期,基于扩散模型的润色技术展现出优越的视觉质量,但由于其生成式特性常面临保真度问题,且迭代采样过程导致效率受限。本文提出一种利用双边空间操作的高效保真润色方法,该方法兼具紧凑性与内容解耦特性。具体而言,我们的模型不直接编辑像素或图像潜在表示,而是预测低分辨率双边网格中的仿射变换,通过学习到的引导图进行切片采样,最终应用于全分辨率图像。该方案在实现高保真度的同时显著提升效率。为保留预训练生成模型的强先验知识,我们利用变分分数蒸馏技术将多步扩散模型蒸馏至双边网格框架中,并辅以提示对齐损失来约束指令遵循行为。此外,我们引入新基准数据集,从保真度、指令遵循和效率三个维度全面评估方法性能。相较于最新润色方法(如Gemini-2.5-Flash/Nano-Banana),本方法可避免内容偏移、大幅降低延迟,在维持高保真度的同时生成视觉愉悦的编辑结果。项目主页:https://openimaginglab.github.io/InstantRetouch/。