Adaptive Inference-Time Scaling via Early-Step Latent Verification for Image Editing

Instruction-based image editing has made notable progress with recent advances in generative models. However, the quality of the edited result is still influenced by the randomly sampled initial noise, particularly in complex editing scenarios. An unsuitable initial noise may lead to unsatisfactory editing results. Recent inference-time scaling methods address this issue by sampling multiple initial noises and selecting better candidates. Nevertheless, most of them follow a decode-then-verify scheme which introduces an efficiency-accuracy trade-off. When decoding is performed after limited inference steps, the decoded images often remain too noisy for reliable assessment, whereas sufficiently denoised images require much higher computational cost. To address this issue, we propose VeriLatent, a plug-and-play adaptive inference-time scaling framework with early-step latent verification for image editing. Specifically, we propose a novel verifier that scores each initial noise through a latent-space editing activation map at an early stage. It identifies promising candidates by assessing whether they can induce an effective edit in the correct region. This enables efficient early pruning without decoding latents into images. Building on this, we further develop an adaptive search strategy for inference-time scaling. It allocates inference budgets according to editing difficulty, thereby reducing the number of function evaluations (NFE). Extensive experiments on multiple benchmarks and different base models demonstrate that VeriLatent consistently improves both editing performance and inference-time scaling efficiency.

翻译：指令式图像编辑随着生成模型的最新进展取得了显著进步。然而，编辑结果的质量仍受随机采样初始噪声的影响，尤其是在复杂编辑场景中。不合适的初始噪声可能导致编辑结果不理想。近期推理时缩放方法通过采样多个初始噪声并选择更优候选者来解决这一问题。然而，大多数方法遵循"先解码后验证"方案，这引入了效率与准确性的权衡。当经过有限推理步骤解码时，解码后图像往往噪声过大而难以可靠评估；而充分去噪的图像则需要更高计算成本。为解决此问题，我们提出VeriLatent——一种即插即用的自适应推理时缩放框架，通过早期步骤潜在验证实现图像编辑。具体而言，我们提出一种新型验证器，在早期阶段通过潜在空间编辑激活图对每个初始噪声进行评分。它通过评估候选者能否在正确区域引发有效编辑来识别有潜力的候选者。这使得无需将潜在表示解码为图像即可实现高效早期剪枝。在此基础上，我们进一步开发了推理时缩放的自适应搜索策略。该策略根据编辑难度分配推理预算，从而减少函数评估次数（NFE）。在多个基准和不同基础模型上的大量实验表明，VeriLatent持续提升了编辑性能和推理时缩放效率。