XLSR-53 a multilingual model of speech, builds a vector representation from audio, which allows for a range of computational treatments. The experiments reported here use this neural representation to estimate the degree of closeness between audio files, ultimately aiming to extract relevant linguistic properties. We use max-pooling to aggregate the neural representations from a "snippet-lect" (the speech in a 5-second audio snippet) to a "doculect" (the speech in a given resource), then to dialects and languages. We use data from corpora of 11 dialects belonging to 5 less-studied languages. Similarity measurements between the 11 corpora bring out greatest closeness between those that are known to be dialects of the same language. The findings suggest that (i) dialect/language can emerge among the various parameters characterizing audio files and (ii) estimates of overall phonetic/phonological closeness can be obtained for a little-resourced or fully unknown language. The findings help shed light on the type of information captured by neural representations of speech and how it can be extracted from these representations
翻译:XLSR-53 是一个多语言语音模型,可从音频中构建向量表征,从而支持多种计算处理。本文报告的实验利用这种神经表征来估计音频文件之间的接近程度,最终旨在提取相关的语言属性。我们采用最大池化方法,将神经表征从“片段语言”(5秒音频片段中的语音)聚合为“文档语言”(特定资源中的语音),再进一步聚合至方言和语言层面。我们使用了来自5种低资源语言所属11种方言的语料库数据。对11个语料库进行相似度测量后,发现已知同属一种语言的方言之间相似度最高。研究结果表明:(i)方言/语言特征可以从音频文件的多种参数中浮现;(ii)对于资源匮乏或完全未知的语言,可获取其整体语音/音系接近程度的估计。这些发现有助于揭示语音神经表征所捕获的信息类型,以及如何从这些表征中提取此类信息。