Disentangled representation learning in speech processing has lagged behind other domains, largely due to the lack of datasets with annotated generative factors for robust evaluation. To address this, we propose SynSpeech, a novel large-scale synthetic speech dataset specifically designed to enable research on disentangled speech representations. SynSpeech includes controlled variations in speaker identity, spoken text, and speaking style, with three dataset versions to support experimentation at different levels of complexity. In this study, we present a comprehensive framework to evaluate disentangled representation learning techniques, applying both linear probing and established supervised disentanglement metrics to assess the modularity, compactness, and explicitness of the representations learned by a state-of-the-art model. Using the RAVE model as a test case, we find that SynSpeech facilitates benchmarking across a range of factors, achieving promising disentanglement of simpler features like gender and speaking style, while highlighting challenges in isolating complex attributes like speaker identity. This benchmark dataset and evaluation framework fills a critical gap, supporting the development of more robust and interpretable speech representation learning methods.
翻译:语音处理中的解耦表示学习相较于其他领域发展滞后,主要原因在于缺乏带有标注生成因子的数据集以进行稳健评估。为解决这一问题,我们提出了SynSpeech——一个专为解耦语音表示研究设计的新型大规模合成语音数据集。SynSpeech包含说话人身份、语音文本和说话风格的可控变体,并提供三个数据集版本以支持不同复杂度层级的实验。在本研究中,我们提出了一个综合评估框架,通过线性探测和成熟的监督解耦度量方法,评估先进模型所学表示的模块化程度、紧凑性和显式性。以RAVE模型为测试案例,我们发现SynSpeech能够支持多因素基准测试,在性别和说话风格等简单特征上实现良好的解耦效果,同时突显了分离说话人身份等复杂属性所面临的挑战。该基准数据集与评估框架填补了关键空白,有助于开发更稳健、可解释的语音表示学习方法。