In dyadic speaker-listener interactions, the listener's head reactions along with the speaker's head movements, constitute an important non-verbal semantic expression together. The listener Head generation task aims to synthesize responsive listener's head videos based on audios of the speaker and reference images of the listener. Compared to the Talking-head generation, it is more challenging to capture the correlation clues from the speaker's audio and visual information. Following the ViCo baseline scheme, we propose a high-performance solution by enhancing the hierarchical semantic extraction capability of the audio encoder module and improving the decoder part, renderer and post-processing modules. Our solution gets the first place on the official leaderboard for the track of listening head generation. This paper is a technical report of ViCo@2023 Conversational Head Generation Challenge in ACM Multimedia 2023 conference.
翻译:在二元说话者-听者交互中,听者的头部反应与说话者的头部运动共同构成了重要的非言语语义表达。听者头部生成任务旨在基于说话者的音频和听者的参考图像,合成具有响应性的听者头部视频。与说话者头部生成相比,从说话者的音频和视觉信息中捕捉关联线索更具挑战性。基于ViCo基线方案,我们通过增强音频编码器模块的层次化语义提取能力,并改进解码器、渲染器和后处理模块,提出了一种高性能解决方案。该方案在听者头部生成赛道官方排行榜上取得第一名。本文是ACM多媒体2023会议ViCo@2023对话头部生成挑战赛的技术报告。