ARGUS: Stacked Multi-View Identity Mosaic Injection for Subject-Preserving Video Generation

Subject-preserving video generation is not solved by frontal-face similarity alone: a generated person must remain recognizable across motion, large viewpoint changes, expression shifts, occlusion, scale variation, and conflicts among text, first-frame, and identity references. We argue that the central bottleneck is the point-reference paradigm, which collapses identity into a single static observation entangled with pose, accessories, lighting, background, and camera statistics. We introduce Argus, a Wan-based framework centered on Stacked Multi-View Identity Mosaic Injection (SMII). SMII converts MLLM-selected image/video identity evidence into a 3*3 stacked mosaic, synchronizes the mosaic with the current diffusion time, and injects it as negative-time read-only memory in Wan's native token space. This turns identity from an external clean adapter or a single reference image into a compact dynamic distribution. Around SMII, an MLLM Identity Director selects informative identity moments and resolves condition conflicts, while no-cross-pair counterfactual training, Temporal Identity Annealing, and Adaptive Self-Likeness Guidance improve robustness without paired subject-video supervision. We further release HardID-Celeb, a public-figure identity-stress benchmark, and introduce YawScore and OccScore to probe large-yaw and first-frame-occlusion robustness. Argus achieves state-of-the-art results on OpenS2V-Eval Human-Domain, reaching 64.38 Total Score, 71.86 FaceSim, 51.62 NexusScore, and 79.14 NaturalScore. On HardID-Celeb, Argus obtains 76.80 FaceSim and improves YawScore and OccScore by 12.60 and 15.10 points over the strongest baselines, demonstrating that dynamic identity memory and large-scale counterfactual self-supervision are highly effective for subject-preserving video generation.

翻译：摘要：主体保持视频生成问题并非仅通过正面面部相似度即可解决：生成的人物必须在运动、大视角变化、表情转换、遮挡、尺度变化以及文本、首帧与身份参考信息冲突等场景中保持可识别性。我们认为其核心瓶颈在于点参考范式——该范式将身份信息坍缩为与姿态、配饰、光照、背景和相机参数耦合的单一静态观测。我们提出Argus方法，这是一种基于Wan框架、以堆叠多视角身份镶嵌注入（SMII）为核心的生成架构。SMII将多模态大模型（MLLM）筛选的图像/视频身份证据转化为3×3堆叠镶嵌图，同步当前扩散时间步将其注入该镶嵌图，并作为负时间只读存储嵌入Wan原生token空间。这一设计将身份表征从外部清洁适配器或单张参考图像转变为紧凑动态分布。围绕SMII，我们设计了MLLM身份导演模块以选择信息量丰富的身份关键帧并解决条件冲突，同时引入无跨样本对反事实训练、时序身份退火算法及自适应自相似引导技术，在无需配对主体-视频监督条件下提升鲁棒性。我们还发布了面向公众人物的身份压力测试基准HardID-Celeb，并引入偏航角评分（YawScore）与遮挡评分（OccScore）以量化大角度偏转及首帧遮挡场景下的鲁棒性。在OpenS2V-Eval人类域测试中，Argus取得当前最优成果：总分64.38分，面部相似度71.86分，身份一致性评分51.62分，自然度评分79.14分。在HardID-Celeb基准上，Argus面部相似度达76.80分，较最优基线模型在偏航角评分与遮挡评分上分别提升12.60分与15.10分，充分证明动态身份记忆与大尺度反事实自监督机制对主体保持视频生成任务的有效性。