Text-Vision Co-Instructed Image Editing

Existing image editing methods can be generally categorized into textual instruction-based and visual prompt-based ones. Textual instructions are semantically expressive, but are limited by the coarse granularity of spatial control of the editing results. In contrast, visual prompts such as drag and point can provide precise spatial guidance, but are limited by the inherent ambiguity in semantic intent. To unify the strength of textual and visual prompts, we present Text-Vision Co-Instructed Image Editing, which jointly models textual instructions as semantic intent and sparse visual instructions as spatial guidance, aiming to achieve precise and intent-faithful image manipulation. To this end, we first construct a textual-visual instruction paired dataset with more than 23K samples derived from dynamic videos, enabling aligned supervision for cross-modal instruction. We then propose TV-Edit, a Textual-Visual instruction unified Editing framework to contextualize drag or point-based visual instructions with image-text semantics and lift them into semantic-aware control representations for pretrained editing backbones. By integrating semantic intent and spatial constraints, TV-Edit leads to more precise spatial control, less instruction ambiguity, and stronger structural consistency than text-only or drag-based alternatives. Finally, we establish TV-Edit-Bench, a deliberately designed benchmark to evaluate semantic faithfulness, spatial alignment, and visual consistency with ground-truth references and controlled textual-visual variations for reliable assessment. Our experiments across multiple editing backbones demonstrate that TV-Edit consistently yields more precise and intent-faithful edits, significantly outperforming state-of-the-art instruction-based and drag-based baselines.

翻译：现有的图像编辑方法通常可分为基于文本指令和基于视觉提示两类。文本指令在语义表达上具有丰富性，但受限于对编辑结果空间控制的粗粒度；而拖拽或点选等视觉提示虽能提供精确的空间引导，却受限于语义意图的内在模糊性。为融合文本与视觉提示的优势，我们提出文本-视觉协同指导的图像编辑方法，该方法联合建模文本指令的语义意图与稀疏视觉指令的空间引导，旨在实现精确且符合意图的图像操作。为此，我们首先从动态视频中构建包含超过23K样本的文本-视觉指令配对数据集，为跨模态指令提供对齐监督。随后提出TV-Edit——一个文本-视觉指令统一的编辑框架，将拖拽或点选等视觉指令与图像-文本语义进行上下文关联，并将其转化为面向预训练编辑骨干的语义感知控制表征。通过整合语义意图与空间约束，TV-Edit相比纯文本或基于拖拽的替代方法，实现了更精确的空间控制、更低的指令歧义性以及更强的结构一致性。最后，我们建立精心设计的TV-Edit-Bench基准，用于评估语义忠实度、空间对齐度及视觉一致性，该基准通过真实参考图像与受控的文本-视觉变化实现可靠评估。在多个编辑骨干上的实验表明，TV-Edit始终生成更精确且符合意图的编辑结果，显著优于当前最先进的基于指令和基于拖拽的基线方法。