Text-guided image editing with visual autoregressive (VAR) generators requires controlling both what the model samples and where the sampled change is written back into the image code. Existing VAR editors mainly operate on token streams, features, or flat next-token logits, leaving two native structures of bitwise-residual VAR models underused: the per-bit Bernoulli prediction head and the additive multi-scale residual code field from which the image is assembled. We propose BitResEdit, a training-free editor for bitwise-residual VAR generators such as Infinity. BitEdit performs source-negative guidance by tilting the post-CFG per-bit log-odds along a source--target contrast computed on a shared edited prefix, then projects each update into a closed-form Bernoulli-KL trust region around the clean CFG sampler. ResEdit converts the sampled bits into per-scale continuous-code residuals, gates them with a localization mask, and re-injects them through the generator's native sum-of-scales. Together they couple decision-time bit guidance with combination-time code composition, so masked-out latent features are preserved exactly by code arithmetic while localized, scale-aware edits are applied inside the target region. On PIE-Bench with Infinity-2B, BitResEdit attains the strongest text alignment among same-backbone VAR editors, improving CLIP on the edited region by +1.07 over the strongest prior editor while keeping background preservation competitive with it. Ablations show BitEdit and ResEdit play complementary roles in target alignment and background preservation.
翻译:基于文本引导的视觉自回归图像编辑要求同时控制模型的采样内容以及被采样变化写回图像编码的位置。现有视觉自回归编辑器主要操作于词元序列、特征或扁平化的下一词元逻辑值,未能充分利用按位残差视觉自回归模型的两个原生结构:逐位伯努利预测头以及用于组装图像的可加性多尺度残差编码场。我们提出BitResEdit——一种面向Infinity等按位残差视觉自回归生成器且无需训练的编辑框架。BitEdit通过沿共享编辑前缀上计算的源-目标对比度倾斜后CFG的逐位对数几率实现源负向引导,再将每次更新投影至清洁CFG采样器周围的闭式伯努利-KL信任域中。ResEdit将采样位转换为逐尺度连续编码残差,利用定位掩码对其进行门控,并通过生成器原生的尺度和机制重新注入。二者协同耦合了决策时的位引导与组合时的编码合成,使得被掩码的潜在特征通过编码算术得以精确保留,同时在目标区域内施加具有定位感知的尺度自适应编辑。在Infinity-2B模型的PIE-Bench基准上,BitResEdit在采用相同主干网络的视觉自回归编辑器中取得了最强的文本对齐性能,在编辑区域上的CLIP得分较最优先验编辑器提升+1.07,同时背景保持性能与之相当。消融实验表明,BitEdit与ResEdit在目标对齐与背景保持中发挥互补作用。