Recent advancements in deep learning and computer vision have led to a surge of interest in generating realistic talking heads. This paper presents a comprehensive survey of state-of-the-art methods for talking head generation. We systematically categorises them into four main approaches: image-driven, audio-driven, video-driven and others (including neural radiance fields (NeRF), and 3D-based methods). We provide an in-depth analysis of each method, highlighting their unique contributions, strengths, and limitations. Furthermore, we thoroughly compare publicly available models, evaluating them on key aspects such as inference time and human-rated quality of the generated outputs. Our aim is to provide a clear and concise overview of the current landscape in talking head generation, elucidating the relationships between different approaches and identifying promising directions for future research. This survey will serve as a valuable reference for researchers and practitioners interested in this rapidly evolving field.
翻译:深度学习与计算机视觉的最新进展引发了生成逼真说话头像的研究热潮。本文系统综述了说话头像生成领域的最先进方法,将其系统归类为四种主要方法:图像驱动、音频驱动、视频驱动及其他方法(包括神经辐射场和基于三维的方法)。我们深入分析了每种方法,突出其独特贡献、优势与局限性。此外,我们全面比较了公开可用的模型,从推理时间与生成输出的人类评分质量等关键维度进行评估。本研究旨在清晰呈现说话头像生成领域的当前格局,阐明不同方法之间的关系,并识别未来研究的潜在方向。本综述将为关注这一快速发展领域的研究人员与实践者提供有价值的参考。