Controllable video character replacement with a user-provided identity remains a challenging problem due to the lack of paired video data. Prior works have predominantly relied on a reconstruction-based paradigm that requires per-frame segmentation masks and explicit structural guidance (e.g., skeleton, depth). This reliance, however, severely limits their generalizability in complex scenarios involving occlusions, character-object interactions, unusual poses, or challenging illumination, often leading to visual artifacts and temporal inconsistencies. In this paper, we propose MoCha, a pioneering framework that bypasses these limitations by requiring only a single arbitrary frame mask. To effectively adapt the multi-modal input condition and enhance facial identity, we introduce a condition-aware RoPE and employ an RL-based post-training stage. Furthermore, to overcome the scarcity of qualified paired-training data, we propose a comprehensive data construction pipeline. Specifically, we design three specialized datasets: a high-fidelity rendered dataset built with Unreal Engine 5 (UE5), an expression-driven dataset synthesized by current portrait animation techniques, and an augmented dataset derived from existing video-mask pairs. Extensive experiments demonstrate that our method substantially outperforms existing state-of-the-art approaches. We will release the code to facilitate further research. Please refer to our project page for more details: orange-3dv-team.github.io/MoCha
翻译:基于用户提供身份的可控视频角色替换由于缺乏配对视频数据,仍然是一个具有挑战性的问题。先前的研究主要依赖于基于重建的范式,该范式需要逐帧分割掩码和明确的结构引导(例如,骨架、深度)。然而,这种依赖性严重限制了其在涉及遮挡、角色-物体交互、非常规姿态或挑战性光照等复杂场景中的泛化能力,通常会导致视觉伪影和时间不一致性。在本文中,我们提出了MoCha,这是一个开创性的框架,它通过仅需单个任意帧掩码来绕过这些限制。为了有效适应多模态输入条件并增强面部身份,我们引入了条件感知RoPE并采用了基于强化学习的后训练阶段。此外,为了克服合格配对训练数据的稀缺性,我们提出了一个全面的数据构建流程。具体而言,我们设计了三个专用数据集:一个使用虚幻引擎5(UE5)构建的高保真渲染数据集,一个通过当前肖像动画技术合成的表情驱动数据集,以及一个从现有视频-掩码对衍生的增强数据集。大量实验表明,我们的方法显著优于现有的最先进方法。我们将发布代码以促进进一步研究。更多详情请参阅我们的项目页面:orange-3dv-team.github.io/MoCha