In this study, our goal is to create interactive avatar agents that can autonomously plan and animate nuanced facial movements realistically, from both visual and behavioral perspectives. Given high-level inputs about the environment and agent profile, our framework harnesses LLMs to produce a series of detailed text descriptions of the avatar agents' facial motions. These descriptions are then processed by our task-agnostic driving engine into motion token sequences, which are subsequently converted into continuous motion embeddings that are further consumed by our standalone neural-based renderer to generate the final photorealistic avatar animations. These streamlined processes allow our framework to adapt to a variety of non-verbal avatar interactions, both monadic and dyadic. Our extensive study, which includes experiments on both newly compiled and existing datasets featuring two types of agents -- one capable of monadic interaction with the environment, and the other designed for dyadic conversation -- validates the effectiveness and versatility of our approach. To our knowledge, we advanced a leap step by combining LLMs and neural rendering for generalized non-verbal prediction and photo-realistic rendering of avatar agents.
翻译:在本研究中,我们的目标是创建能够自主规划并逼真地呈现细微面部运动的交互式化身代理,同时兼顾视觉与行为两个维度。给定关于环境和代理配置的高级输入后,我们的框架利用大型语言模型(LLM)生成一系列关于化身代理面部运动的详细文本描述。这些描述随后由我们任务无关的驱动引擎处理为运动令牌序列,进而转换为连续运动嵌入,并由独立的神经渲染器进一步消费,最终生成照片级逼真的化身动画。这些简化的流程使我们的框架能够适应多种非语言化身交互场景,包括单人交互与双人对话。我们开展了广泛研究,在全新编译与现有数据集上对两种代理类型进行实验——一种能够与环境进行单人交互,另一种专为双人对话设计——验证了我们方法的有效性与通用性。据我们所知,通过结合大型语言模型与神经渲染技术,我们在化身代理的非语言预测泛化与照片级逼真渲染方面迈出了突破性的一步。