In this study, our goal is to create interactive avatar agents that can autonomously plan and animate nuanced facial movements realistically, from both visual and behavioral perspectives. Given high-level inputs about the environment and agent profile, our framework harnesses LLMs to produce a series of detailed text descriptions of the avatar agents' facial motions. These descriptions are then processed by our task-agnostic driving engine into motion token sequences, which are subsequently converted into continuous motion embeddings that are further consumed by our standalone neural-based renderer to generate the final photorealistic avatar animations. These streamlined processes allow our framework to adapt to a variety of non-verbal avatar interactions, both monadic and dyadic. Our extensive study, which includes experiments on both newly compiled and existing datasets featuring two types of agents -- one capable of monadic interaction with the environment, and the other designed for dyadic conversation -- validates the effectiveness and versatility of our approach. To our knowledge, we advanced a leap step by combining LLMs and neural rendering for generalized non-verbal prediction and photo-realistic rendering of avatar agents.
翻译:在本研究中,我们的目标是创建能够从视觉与行为两个维度自主规划并逼真演绎微妙面部运动的交互式化身代理。给定环境与代理配置的高级输入后,我们的框架利用大语言模型生成一系列描述化身代理面部运动的详细文本。这些描述随后由与任务无关的驱动引擎转化为运动令牌序列,进而转换为连续运动嵌入,并由独立的神经渲染器处理,最终生成照片级逼真的化身动画。这一流线化流程使我们的框架能够适应多种非语言化身交互场景,既包括单方互动,也包括双人对话。我们开展了广泛研究,在新建数据集与现有数据集上对两类代理进行了实验:一类能够与环境进行单方互动,另一类专为双人对话设计。实验结果验证了我们方法的高效性与通用性。据我们所知,通过将大语言模型与神经渲染相结合,我们在化身代理的泛化非语言预测与照片级逼真渲染方面实现了突破性进展。