We present the latest iteration of the voice conversion challenge (VCC) series, a bi-annual scientific event aiming to compare and understand different voice conversion (VC) systems based on a common dataset. This year we shifted our focus to singing voice conversion (SVC), thus named the challenge the Singing Voice Conversion Challenge (SVCC). A new database was constructed for two tasks, namely in-domain and cross-domain SVC. The challenge was run for two months, and in total we received 26 submissions, including 2 baselines. Through a large-scale crowd-sourced listening test, we observed that for both tasks, although human-level naturalness was achieved by the top system, no team was able to obtain a similarity score as high as the target speakers. Also, as expected, cross-domain SVC is harder than in-domain SVC, especially in the similarity aspect. We also investigated whether existing objective measurements were able to predict perceptual performance, and found that only few of them could reach a significant correlation.
翻译:我们推出了语音转换挑战赛(VCC)系列的最新版本,这是一项每两年举办一次的科学活动,旨在基于公共数据集对不同语音转换(VC)系统进行比较和理解。今年,我们将重点转向歌声转换(SVC),因此将此次挑战赛命名为歌声转换挑战赛(SVCC)。为两项任务(即领域内SVC和跨领域SVC)构建了一个新数据库。挑战赛历时两个月,共收到26份提交作品,其中包括两个基线系统。通过大规模众包聆听测试,我们观察到,在两项任务中,尽管顶级系统达到了人类水平的自然度,但没有团队能获得与目标说话人一样高的相似度评分。此外,正如预期,跨领域SVC比领域内SVC更具挑战性,尤其是在相似度方面。我们还研究了现有客观测量指标是否能预测感知表现,发现只有少数指标能达到显著的相关性。