We present Search2Motion, a training-free framework for object-level motion editing in image-to-video generation. Unlike prior methods requiring trajectories, bounding boxes, masks, or motion fields, Search2Motion adopts target-frame-based control, leveraging first-last-frame motion priors to realize object relocation while preserving scene stability without fine-tuning. Reliable target-frame construction is achieved through semantic-guided object insertion and robust background inpainting. We further show that early-step self-attention maps predict object and camera dynamics, offering interpretable user feedback and motivating ACE-Seed (Attention Consensus for Early-step Seed selection), a lightweight search strategy that improves motion fidelity without look-ahead sampling or external evaluators. Noting that existing benchmarks conflate object and camera motion, we introduce S2M-DAVIS and S2M-OMB for stable-camera, object-only evaluation, alongside FLF2V-obj metrics that isolate object artifacts without requiring ground-truth trajectories. Search2Motion consistently outperforms baselines on FLF2V-obj and VBench.
翻译:我们提出Search2Motion,一种无需训练的图像到视频生成中物体级运动编辑框架。与依赖轨迹、边界框、掩膜或运动场的现有方法不同,Search2Motion采用基于目标帧的控制方式,利用首尾帧运动先验实现物体重新定位且无需微调即可保持场景稳定性。通过语义引导的物体插入与鲁棒的背景补全实现可靠的目标帧构建。我们进一步证明早期自注意力图可预测物体与相机动态,提供可解释的用户反馈,并启发ACE-Seed(基于注意力一致性的早期种子选择)——一种无需前瞻采样或外部评估器即可提升运动保真度的轻量级搜索策略。针对现有基准将物体运动与相机运动混淆的问题,我们引入S2M-DAVIS与S2M-OMB实现稳定相机场景下的纯物体运动评估,同时提出FLF2V-obj指标在不依赖真实轨迹的情况下分离物体伪影。Search2Motion在FLF2V-obj与VBench指标上持续优于基线方法。