We present SentiAvatar, a framework for building expressive interactive 3D digital humans, and use it to create SuSu, a virtual character that speaks, gestures, and emotes in real time. Achieving such a system remains challenging, as it requires jointly addressing three key problems: the lack of large-scale, high-quality multimodal data, robust semantic-to-motion mapping, and fine-grained frame-level motion-prosody synchronization. To solve these problems, first, we build SuSuInterActs (21K clips, 37 hours), a dialogue corpus captured via optical motion capture around a single character with synchronized speech, full-body motion, and facial expressions. Second, we pre-train a Motion Foundation Model on 200K+ motion sequences, equipping it with rich action priors that go well beyond the conversation. We then propose an audio-aware plan-then-infill architecture that decouples sentence-level semantic planning from frame-level prosody-driven interpolation, so that generated motions are both semantically appropriate and rhythmically aligned with speech. Experiments show that SentiAvatar achieves state-of-the-art on both SuSuInterActs (R@1 43.64%, nearly 2 times the best baseline) and BEATv2 (FGD 4.941, BC 8.078), producing 6s of output in 0.3s with unlimited multi-turn streaming. The source code, model, and dataset are available at https://sentiavatar.github.io.
翻译:我们提出了SentiAvatar,一个用于构建富有表现力的交互式三维数字人的框架,并利用它创建了SuSu——一个能够实时说话、做手势和表达情感的虚拟角色。实现这样的系统仍具有挑战性,因为它需要共同解决三个关键问题:缺乏大规模、高质量的多模态数据、鲁棒的语义到动作映射,以及细粒度的帧级动作与韵律同步。为解决这些问题,首先,我们构建了SuSuInterActs(21K片段,37小时),这是一个通过光学动作捕捉系统围绕单个角色采集的对话语料库,包含同步的语音、全身动作和面部表情。其次,我们在20万+动作序列上预训练了一个动作基础模型,使其具备远超对话场景的丰富动作先验知识。随后,我们提出了一种音频感知的“规划-填充”架构,将句子级别的语义规划与帧级别的韵律驱动插值解耦,使得生成的动作既语义恰当,又与语音节奏对齐。实验表明,SentiAvatar在SuSuInterActs(R@1 43.64%,近乎最优基线的2倍)和BEATv2(FGD 4.941,BC 8.078)上均达到最优性能,能在0.3秒内生成6秒的输出,并支持无限的多轮流式生成。源代码、模型和数据集请见https://sentiavatar.github.io。