Avatar V: Scaling Video-Reference Avatar Video Generation

Benjamin Liang,Ce Chen,Desmond Lin,Ivan Somov,Jiajun Zhao,Jiewei Yuan,Jingfeng Zhang,Junhao Huang,Nik Nolte,Pedram Haqiqi,Penghan Wang,Rong Yan,Rui Zhang,Sam Prokopchuk,Sivan Wang,Viktor Goriachko,Yi Ren,Yuanming Li,Yutao Chen,Zhenhui Ye,Zhibin Hong,Zilong Nie,Zujin Guo

from arxiv, 31 pages, 15 figures. All contributors are listed in alphabetical order by first name

Generating avatar videos that are not merely visually similar to a target individual but behaviorally recognizable, faithfully reproducing their talking rhythm, gestural tendencies, and expression dynamics, remains an open challenge. Existing methods predominantly condition on single static images, which provide insufficient identity information and cannot capture dynamic motion traits, while standard pixel-level objectives underserve the perceptually critical facial regions that determine avatar fidelity. We present Avatar V, a production-scale framework that addresses these limitations through video-reference-conditioned identity modeling. Rather than compressing identity into fixed-size embeddings, the model conditions directly on the full token sequence of a reference video, learning to reproduce both static identity attributes (facial geometry, skin texture) and dynamic behavioral patterns (talking rhythm, micro-expressions) through attention over the reference context. We introduce Sparse Reference Attention, an asymmetric mechanism achieving linear-complexity conditioning on arbitrarily long references; a motion representation stream enabling closed-loop talking style transfer; and an identity-aware super-resolution refiner inheriting the full reference conditioning. These are supported by a data engine curating 100M+ training clips from 50M raw videos, and a five-stage training pipeline with flow matching pre-training, personality fine-tuning, two-phase distillation (>10x acceleration), and RLHF alignment, deployed across thousands of GPUs. Avatar V generates 1080p videos of unlimited duration, achieving state-of-the-art identity preservation, lip synchronization, and generation quality on our cross-scene benchmark, consistently outperforming leading systems including Seedance 2.0, Kling O3 Pro, Veo 3.1, and OmniHuman 1.5 in both automated metrics and human evaluation.

翻译：生成不仅在外观上相似，更在行为上可识别的虚拟形象视频——忠实再现目标个体的言谈节奏、手势习惯与表情动态——仍是一项开放性挑战。现有方法主要依赖单张静态图像进行条件控制，这不仅无法提供充分的身份信息，更难以捕捉动态运动特征；同时，标准像素级损失函数未能充分服务于决定虚拟形象保真度的感知关键面部区域。我们提出Avatar V这一生产级框架，通过基于视频参考的身份建模解决上述局限。该模型不将身份信息压缩为固定维度嵌入向量，而是直接以参考视频的完整令牌序列作为条件，通过注意力机制在参考上下文中学习复现静态身份属性（面部几何、皮肤纹理）与动态行为模式（言谈节奏、微表情）。我们提出稀疏参考注意力——一种实现线性复杂度处理任意长度参考视频的非对称机制；运动表征流——支持闭环言谈风格迁移；以及身份感知超分辨率优化器——继承完整参考条件信息。这些模块由数据引擎支撑，该引擎从5000万原始视频中筛选超1亿训练片段，并配备包含流匹配预训练、个性微调、两阶段蒸馏（>10倍加速）及RLHF对齐的五阶段训练管线，部署于数千GPU之上。Avatar V可生成不限时长的1080p视频，在跨场景基准测试中实现最优的身份保持、唇形同步与生成质量，在自动评估指标与人工评测中均持续超越Seedance 2.0、Kling O3 Pro、Veo 3.1及OmniHuman 1.5等领先系统。