Disentangled representation learning from speech remains limited despite its importance in many application domains. A key challenge is the lack of speech datasets with known generative factors to evaluate methods. This paper proposes SynSpeech: a novel synthetic speech dataset with ground truth factors enabling research on disentangling speech representations. We plan to present a comprehensive study evaluating supervised techniques using established supervised disentanglement metrics. This benchmark dataset and framework address the gap in the rigorous evaluation of state-of-the-art disentangled speech representation learning methods. Our findings will provide insights to advance this underexplored area and enable more robust speech representations.
翻译:解耦表征学习在语音领域尽管在许多应用场景中至关重要,但仍存在局限性。一个关键挑战是缺乏具有已知生成因素的真实语音数据集来评估相关方法。本文提出SynSpeech:一种新颖的合成语音数据集,其具有真实因子,为研究语音表征解耦提供了基础。我们计划开展一项全面研究,利用已建立的监督解耦度量指标评估监督式方法。该基准数据集与框架填补了对当前最先进解耦语音表征学习方法进行严格评估的空白。我们的研究结果将为推动这一未充分探索领域的发展提供见解,并有助于构建更鲁棒的语音表征。