The information bottleneck auto-encoder is a tool for disentanglement commonly used for voice transformation. The successful disentanglement relies on the right choice of bottleneck size. Previous bottleneck auto-encoders created the bottleneck by the dimension of the latent space or through vector quantization and had no means to change the bottleneck size of a specific model. As the bottleneck removes information from the disentangled representation, the choice of bottleneck size is a trade-off between disentanglement and synthesis quality. We propose to build the information bottleneck using dropout which allows us to change the bottleneck through the dropout rate and investigate adapting the bottleneck size depending on the context. We experimentally explore into using the adaptive bottleneck for pitch transformation and demonstrate that the adaptive bottleneck leads to improved disentanglement of the F0 parameter for both, speech and singing voice leading to improved synthesis quality. Using the variable bottleneck size, we were able to achieve disentanglement for singing voice including extremely high pitches and create a universal voice model, that works on both speech and singing voice with improved synthesis quality.
翻译:信息瓶颈自编码器是语音转换中常用的解耦工具。解耦效果的成功依赖于瓶颈尺寸的合理选择。传统瓶颈自编码器通过潜在空间维度或向量量化构建瓶颈,且无法改变特定模型的瓶颈尺寸。由于瓶颈会从解耦表征中移除信息,瓶颈尺寸的选择需要在解耦效果与合成质量之间进行权衡。我们提出利用dropout构建信息瓶颈,通过调节dropout率改变瓶颈尺寸,并探索根据上下文自适应调整瓶颈尺寸。本文通过实验研究自适应瓶颈在音高变换中的应用,结果表明:自适应瓶颈能够有效提升语音与歌声中F0参数的解耦效果,从而改善合成质量。借助可变瓶颈尺寸,我们不仅实现了包含极高音高在内的歌声解耦,还构建了通用语音模型,该模型可同时处理语音与歌声,并具备更优的合成质量。