This work considers merging two independent models, TTS and A2F, into a unified model to enable internal feature transfer, thereby improving the consistency between audio and facial expressions generated from text. We also discuss the extension of the emotion control mechanism from TTS to the joint model. This work does not aim to showcase generation quality; instead, from a system design perspective, it validates the feasibility of reusing intermediate representations from TTS for joint modeling of speech and facial expressions, and provides engineering practice references for subsequent speech expression co-design. The project code has been open source at: https://github.com/GoldenFishes/UniTAF
翻译:本研究探讨将两个独立模型——TTS(文本到语音)与A2F(音频到面部)——合并为统一模型,以实现内部特征迁移,从而提升从文本生成的音频与面部表情之间的一致性。我们同时讨论了将情感控制机制从TTS扩展到联合模型的方法。本工作并非旨在展示生成质量,而是从系统设计角度出发,验证了复用TTS中间表征进行语音与面部表情联合建模的可行性,并为后续语音表情协同设计提供了工程实践参考。项目代码已开源发布于:https://github.com/GoldenFishes/UniTAF