Edit the Bits, Diff the Codes: Bitwise Residual Editing for Visual Autoregressive Models

Text-guided image editing with visual autoregressive (VAR) generators requires controlling both what the model samples and where the sampled change is written back into the image code. Existing VAR editors mainly operate on token streams, features, or flat next-token logits, leaving two native structures of bitwise-residual VAR models underused: the per-bit Bernoulli prediction head and the additive multi-scale residual code field from which the image is assembled. We propose BitResEdit, a training-free editor for bitwise-residual VAR generators such as Infinity. BitEdit performs source-negative guidance by tilting the post-CFG per-bit log-odds along a source--target contrast computed on a shared edited prefix, then projects each update into a closed-form Bernoulli-KL trust region around the clean CFG sampler. ResEdit converts the sampled bits into per-scale continuous-code residuals, gates them with a localization mask, and re-injects them through the generator's native sum-of-scales. Together they couple decision-time bit guidance with combination-time code composition, so masked-out latent features are preserved exactly by code arithmetic while localized, scale-aware edits are applied inside the target region. On PIE-Bench with Infinity-2B, BitResEdit attains the strongest text alignment among same-backbone VAR editors, improving CLIP on the edited region by +1.07 over the strongest prior editor while keeping background preservation competitive with it. Ablations show BitEdit and ResEdit play complementary roles in target alignment and background preservation.

翻译：基于文本引导的视觉自回归图像编辑要求同时控制模型的采样内容以及被采样变化写回图像编码的位置。现有视觉自回归编辑器主要操作于词元序列、特征或扁平化的下一词元逻辑值，未能充分利用按位残差视觉自回归模型的两个原生结构：逐位伯努利预测头以及用于组装图像的可加性多尺度残差编码场。我们提出BitResEdit——一种面向Infinity等按位残差视觉自回归生成器且无需训练的编辑框架。BitEdit通过沿共享编辑前缀上计算的源-目标对比度倾斜后CFG的逐位对数几率实现源负向引导，再将每次更新投影至清洁CFG采样器周围的闭式伯努利-KL信任域中。ResEdit将采样位转换为逐尺度连续编码残差，利用定位掩码对其进行门控，并通过生成器原生的尺度和机制重新注入。二者协同耦合了决策时的位引导与组合时的编码合成，使得被掩码的潜在特征通过编码算术得以精确保留，同时在目标区域内施加具有定位感知的尺度自适应编辑。在Infinity-2B模型的PIE-Bench基准上，BitResEdit在采用相同主干网络的视觉自回归编辑器中取得了最强的文本对齐性能，在编辑区域上的CLIP得分较最优先验编辑器提升+1.07，同时背景保持性能与之相当。消融实验表明，BitEdit与ResEdit在目标对齐与背景保持中发挥互补作用。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR2026】CARE-Edit: 面向上下文相关图像编辑的条件感知专家路由机制

专知会员服务

6+阅读 · 3月10日

【博士论文】论视觉 Transformer (Vision Transformers) 中的归纳偏置

专知会员服务

9+阅读 · 2月13日

【ICML2025】《基于低分辨率词元枢轴的层级掩码自回归模型》

专知会员服务

7+阅读 · 2025年5月27日

视觉自回归模型综述

专知会员服务

25+阅读 · 2024年11月14日