Understanding and controlling latent representations in deep generative models is a challenging yet important problem for analyzing, transforming and generating various types of data. In speech processing, inspiring from the anatomical mechanisms of phonation, the source-filter model considers that speech signals are produced from a few independent and physically meaningful continuous latent factors, among which the fundamental frequency $f_0$ and the formants are of primary importance. In this work, we start from a variational autoencoder (VAE) trained in an unsupervised manner on a large dataset of unlabeled natural speech signals, and we show that the source-filter model of speech production naturally arises as orthogonal subspaces of the VAE latent space. Using only a few seconds of labeled speech signals generated with an artificial speech synthesizer, we propose a method to identify the latent subspaces encoding $f_0$ and the first three formant frequencies, we show that these subspaces are orthogonal, and based on this orthogonality, we develop a method to accurately and independently control the source-filter speech factors within the latent subspaces. Without requiring additional information such as text or human-labeled data, this results in a deep generative model of speech spectrograms that is conditioned on $f_0$ and the formant frequencies, and which is applied to the transformation speech signals. Finally, we also propose a robust $f_0$ estimation method that exploits the projection of a speech signal onto the learned latent subspace associated with $f_0$.
翻译:理解并控制深度生成模型中的潜在表示是分析、转换和生成各类数据时一个具有挑战性且至关重要的问题。在语音处理中,受发声解剖机制的启发,源-滤波模型认为语音信号是由少数独立且具有物理意义的连续潜在因素产生的,其中基频$f_0$和共振峰尤为重要。本工作中,我们从在一个大型无标签自然语音信号数据集上以无监督方式训练的变分自编码器(VAE)出发,证明语音产生的源-滤波模型自然地表现为VAE潜在空间中的正交子空间。利用仅几秒钟的由人工语音合成器生成的有标签语音信号,我们提出了一种方法来识别编码$f_0$及前三个共振峰频率的潜在子空间,表明这些子空间是正交的,并基于该正交性开发了一种在潜在子空间内精确且独立控制源-滤波语音因素的方法。无需文本或人工标注数据等额外信息,这便产生了一个以$f_0$和共振峰频率为条件的深度生成语音语谱图模型,并将其应用于语音信号的转换。最后,我们还提出了一种稳健的$f_0$估计方法,该方法利用语音信号在已学到的与$f_0$相关的潜在子空间上的投影。