We introduce DISSC, a novel, lightweight method that converts the rhythm, pitch contour and timbre of a recording to a target speaker in a textless manner. Unlike DISSC, most voice conversion (VC) methods focus primarily on timbre, and ignore people's unique speaking style (prosody). The proposed approach uses a pretrained, self-supervised model for encoding speech to discrete units, which makes it simple, effective, and fast to train. All conversion modules are only trained on reconstruction like tasks, thus suitable for any-to-many VC with no paired data. We introduce a suite of quantitative and qualitative evaluation metrics for this setup, and empirically demonstrate that DISSC significantly outperforms the evaluated baselines. Code and samples are available at https://pages.cs.huji.ac.il/adiyoss-lab/dissc/.
翻译:我们提出DISSC,一种新颖的轻量级方法,能够以无文本方式将录音的节奏、音高轮廓和音色转换为目标说话人。与DISSC不同,大多数语音转换(VC)方法主要关注音色,而忽略了人们独特的说话风格(韵律)。该方法使用预训练的自监督模型将语音编码为离散单元,使得训练简单、高效且快速。所有转换模块仅在重构类任务上进行训练,因此适用于无需配对数据的任意到多VC。我们为此设定引入了一套定性和定量评估指标,并通过实证表明DISSC显著优于评估基线。代码和样本可在https://pages.cs.huji.ac.il/adiyoss-lab/dissc/获取。