We present a framework for generating full-bodied photorealistic avatars that gesture according to the conversational dynamics of a dyadic interaction. Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands. The key behind our method is in combining the benefits of sample diversity from vector quantization with the high-frequency details obtained through diffusion to generate more dynamic, expressive motion. We visualize the generated motion using highly photorealistic avatars that can express crucial nuances in gestures (e.g. sneers and smirks). To facilitate this line of research, we introduce a first-of-its-kind multi-view conversational dataset that allows for photorealistic reconstruction. Experiments show our model generates appropriate and diverse gestures, outperforming both diffusion- and VQ-only methods. Furthermore, our perceptual evaluation highlights the importance of photorealism (vs. meshes) in accurately assessing subtle motion details in conversational gestures. Code and dataset available online.
翻译:我们提出一个框架,用于生成全身体态的照片级逼真虚拟化身,使其手势能与双人交互中的对话动态保持一致。给定语音音频,我们输出个体手势动作的多种可能性,包括面部、身体和手部。我们方法的关键在于结合向量量化带来的样本多样性优势与扩散过程获得的高频细节,以生成更具动态性和表现力的动作。我们使用高度逼真的虚拟化身来可视化生成的动作,这些化身能够表达手势中的关键细微差异(例如冷笑和假笑)。为推动这一研究方向,我们首次引入了多视角对话数据集,该数据集支持照片级逼真重建。实验表明,我们的模型能够生成适当且多样化的手势,优于仅使用扩散或仅使用向量量化的方法。此外,我们的感知评估强调了照片级逼真(相对于网格模型)在准确评估对话手势中细微动作细节方面的重要性。代码和数据集可在线获取。