PhysEdit: Physically-Consistent Region-Aware Image Editing via Adaptive Spatio-Temporal Reasoning

Image editing instructions are heterogeneous: a color swap, an object insertion, and a physical-action edit all demand different spatial coverage and different reasoning depth, yet existing reasoning-based editors apply a single fixed inference recipe to every instruction. We argue that adaptivity along both the spatial and temporal axes is the missing degree of freedom, and we present PhysEdit, an editing framework built around this principle. PhysEdit introduces two inference-time modules that compose without retraining the backbone. At its core, (1) Complexity-Adaptive Reasoning Depth (CARD) predicts edit complexity directly from the instruction and reference image and allocates the reasoning step count N_r and reasoning-token length r per sample -- turning a previously fixed inference schedule into a conditional-computation problem. CARD is supported by (2) a Spatial Reasoning Mask (SRM) that extracts an instruction-conditioned spatial prior from cross-attention to confine reasoning to regions that semantically require it. On the full 737-case ImgEdit Basic-Edit Suite, PhysEdit delivers a 1.18x wall-clock speedup (64.3s vs. 76.1s per sample) over a strong reasoning baseline while slightly improving instruction adherence (CLIP-T 0.2283 vs. 0.2266, +0.7%) and matching identity preservation within noise (CLIP-I 0.8246 vs. 0.8280). The speedup is category-dependent and reaches 1.52x on appearance-level edits, validating CARD's adaptive allocation as the principal source of efficiency gain. A 30-sample pilot with full ablations isolates the contribution of each module.

翻译：图像编辑指令具有异构性：颜色替换、物体插入与物理动作编辑在空间覆盖范围和推理深度上各有不同需求，但现有基于推理的编辑器对所有指令采用单一固定的推理方案。我们认为空间维度和时间维度的自适应性是缺失的关键自由度，并提出了PhysEdit——一种基于该原理构建的编辑框架。PhysEdit引入两个无需重新训练骨干网络的推理时模块：其核心是（1）复杂度自适应推理深度（CARD），该模块直接从指令和参考图像预测编辑复杂度，并为每个样本分配推理步数N_r和推理令牌长度r——将原本固定的推理调度转化为条件计算问题。CARD由（2）空间推理掩码（SRM）支持，该掩码从交叉注意力中提取指令条件化的空间先验，将推理限制在语义上需要关注的区域。在包含737个样本的完整ImgEdit基础编辑套件上，PhysEdit相比强推理基线实现了1.18倍的实际速度提升（每样本64.3秒对比76.1秒），同时指令遵循度略有提升（CLIP-T 0.2283对比0.2266，提升0.7%），身份保持度在噪声水平内匹配（CLIP-I 0.8246对比0.8280）。加速效果具有类别依赖性，在外观级编辑上达到1.52倍，验证了CARD自适应分配作为效率提升主要来源的有效性。包含完整消融实验的30样本先导研究独立分析了各模块的贡献。