In this paper, we introduce a novel framework for generating multi-speaker speech without relying on any audible inputs. Our approach leverages silent electromyography (EMG) signals to capture linguistic content, while facial images are used to match with the vocal identity of the target speaker. Notably, we present a pitch-disentangled content embedding that enhances the extraction of linguistic content from EMG signals. Extensive analysis demonstrates that our method can generate multi-speaker speech without any audible inputs and confirms the effectiveness of the proposed pitch-disentanglement approach.
翻译:本文提出了一种无需依赖任何可听输入即可生成多说话人语音的新型框架。我们的方法利用静默肌电图信号捕捉语言内容,同时使用面部图像匹配目标说话人的声学身份。值得注意的是,我们提出了一种音高解耦的内容嵌入方法,以增强从肌电信号中提取语言内容的能力。大量分析表明,我们的方法能够在没有任何可听输入的情况下生成多说话人语音,并验证了所提音高解耦方法的有效性。