This paper introduces Point2Insert, a sparse-point-based framework for flexible and user-friendly object insertion in videos, motivated by the growing popularity of accurate, low-effort object placement. Existing approaches face two major challenges: mask-based insertion methods require labor-intensive mask annotations, while instruction-based methods struggle to place objects at precise locations. Point2Insert addresses these issues by requiring only a small number of sparse points instead of dense masks, eliminating the need for tedious mask drawing. Specifically, it supports both positive and negative points to indicate regions that are suitable or unsuitable for insertion, enabling fine-grained spatial control over object locations. The training of Point2Insert consists of two stages. In Stage 1, we train an insertion model that generates objects in given regions conditioned on either sparse-point prompts or a binary mask. In Stage 2, we further train the model on paired videos synthesized by an object removal model, adapting it to video insertion. Moreover, motivated by the higher insertion success rate of mask-guided editing, we leverage a mask-guided insertion model as a teacher to distill reliable insertion behavior into the point-guided model. Extensive experiments demonstrate that Point2Insert consistently outperforms strong baselines and even surpasses models with $\times$10 more parameters.
翻译:本文提出Point2Insert,一种基于稀疏点的框架,用于实现灵活且用户友好的视频对象插入,其动机源于对精确、低工作量对象放置日益增长的需求。现有方法面临两大挑战:基于掩码的插入方法需要劳动密集型的掩码标注,而基于指令的方法难以将对象放置在精确位置。Point2Insert通过仅需少量稀疏点而非密集掩码来解决这些问题,消除了繁琐的掩码绘制需求。具体而言,它同时支持正点和负点,以指示适合或不适合插入的区域,从而实现对对象位置的细粒度空间控制。Point2Insert的训练包含两个阶段。在第一阶段,我们训练一个插入模型,该模型根据稀疏点提示或二值掩码在给定区域生成对象。在第二阶段,我们进一步在通过对象移除模型合成的成对视频上训练该模型,使其适应视频插入任务。此外,受掩码引导编辑具有更高插入成功率的启发,我们利用掩码引导的插入模型作为教师,将可靠的插入行为蒸馏到点引导的模型中。大量实验表明,Point2Insert始终优于强基线模型,甚至超越了参数量多出$\times$10倍的模型。