Discrete speech units obtained via k-means clustering of self supervised embeddings entangle phonetic, speaker, and language information, causing speaker mixing and cross-lingual interference in multilingual multi-speaker speech generation. Despite growing use in Audio LLMs and speech to speech systems, unit vocoders remain underexplored. We analyze a BigVGAN based unit vocoder, across four Indian languages. We study the interaction between cluster size and conditioning strategies using WER, speaker similarity, and unit level metrics. Results show that cluster size governs intelligibility by improving phonetic discriminability, while explicit speaker conditioning is indispensable for preventing identity collapse. Language supervision yields further gains mainly at lower cluster sizes where units remain ambiguous. Our analysis shows similar phonemes across languages collapse to the same cluster IDs at smaller inventories, with larger clusters progressively separating them.
翻译:通过自监督嵌入的k-means聚类获得的离散语音单元纠缠了音素、说话人和语言信息,导致多语言多说话人语音生成中出现说话人混合和跨语言干扰。尽管单元声码器在音频大语言模型和语音到语音系统中应用日益广泛,但其仍未被充分探索。我们基于BigVGAN构建单元声码器,针对四种印度语言进行分析。利用词错误率、说话人相似度和单元级指标,研究了聚类大小与条件化策略之间的相互作用。结果表明,聚类大小通过提升音素可区分性决定可懂度,而显式说话人条件化对防止身份坍缩不可或缺。语言监督主要在聚类较小时进一步带来增益(此时单元仍具有歧义性)。我们的分析显示,在较小库存量下,不同语言中的相似音素会坍缩至相同聚类索引,而更大的聚类会逐步将其分离。