Disentangling uncorrelated information in speech utterances is a crucial research topic within speech community. Different speech-related tasks focus on extracting distinct speech representations while minimizing the affects of other uncorrelated information. We present a large-scale speech corpus to facilitate the research of speech representation disentanglement. 3D-Speaker contains over 10,000 speakers, each of whom are simultaneously recorded by multiple Devices, locating at different Distances, and some speakers are speaking multiple Dialects. The controlled combinations of multi-dimensional audio data yield a matrix of a diverse blend of speech representation entanglement, thereby motivating intriguing methods to untangle them. The multi-domain nature of 3D-Speaker also makes it a suitable resource to evaluate large universal speech models and experiment methods of out-of-domain learning and self-supervised learning. https://3dspeaker.github.io/
翻译:语音中不相关信息的解耦是语音领域的关键研究课题。不同语音相关任务致力于提取特定语音表征,同时最小化其他无关信息的干扰。本文提出大规模语音语料库,以推动语音表征解耦研究。3D-Speaker包含超过10000名说话人,每位说话人均被多台设备(Device)同步记录,分布于不同距离(Distance),部分说话人使用多种方言(Dialect)。多维度音频数据的受控组合形成了语音表征纠缠的多样化混合矩阵,从而激发创新性的解耦方法。3D-Speaker的多域特性使其成为评估通用语音大模型、探索域外学习与自监督学习方法的重要资源。https://3dspeaker.github.io/