Annotating a multilingual code-switched corpus is a painstaking process requiring specialist linguistic expertise. This is partly due to the large number of language combinations that may appear within and across utterances, which might require several annotators with different linguistic expertise to consider an utterance sequentially. This is time-consuming and costly. It would be useful if the spoken languages in an utterance and the boundaries thereof were known before annotation commences, to allow segments to be assigned to the relevant language experts in parallel. To address this, we investigate the development of a continuous multilingual language diarizer using fine-tuned speech representations extracted from a large pre-trained self-supervised architecture (WavLM). We experiment with a code-switched corpus consisting of five South African languages (isiZulu, isiXhosa, Setswana, Sesotho and English) and show substantial diarization error rate improvements for language families, language groups, and individual languages over baseline systems.
翻译:标注多语言代码混用语料库是一项繁琐的过程,需要专业语言学知识。部分原因是由于话语内部及跨话语可能出现大量语言组合,这可能要求多位具有不同语言学专业知识的标注者依次考虑同一段话语。这一过程既耗时又成本高昂。如果在标注开始前能预先知道话语中的口语语言及其边界,从而允许将片段并行分配给相关的语言专家,将会十分有用。为解决此问题,我们研究了基于从大型预训练自监督架构(WavLM)中提取的微调语音表征,开发连续多语言语言日志分割器的方法。我们使用包含五种南非语言(祖鲁语、科萨语、茨瓦纳语、塞索托语和英语)的代码混用语料库进行实验,结果表明,与基准系统相比,在语言族、语言组和单个语言级别上,日志分割错误率均有显著改善。