Languages have long been described according to their perceived rhythmic attributes. The associated typologies are of interest in psycholinguistics as they partly predict newborns' abilities to discriminate between languages and provide insights into how adult listeners process non-native languages. Despite the relative success of rhythm metrics in supporting the existence of linguistic rhythmic classes, quantitative studies have yet to capture the full complexity of temporal regularities associated with speech rhythm. We argue that deep learning offers a powerful pattern-recognition approach to advance the characterization of the acoustic bases of speech rhythm. To explore this hypothesis, we trained a medium-sized recurrent neural network on a language identification task over a large database of speech recordings in 21 languages. The network had access to the amplitude envelopes and a variable identifying the voiced segments, assuming that this signal would poorly convey phonetic information but preserve prosodic features. The network was able to identify the language of 10-second recordings in 40% of the cases, and the language was in the top-3 guesses in two-thirds of the cases. Visualization methods show that representations built from the network activations are consistent with speech rhythm typologies, although the resulting maps are more complex than two separated clusters between stress and syllable-timed languages. We further analyzed the model by identifying correlations between network activations and known speech rhythm metrics. The findings illustrate the potential of deep learning tools to advance our understanding of speech rhythm through the identification and exploration of linguistically relevant acoustic feature spaces.
翻译:根据感知节奏特征,语言长期以来被按照其韵律属性进行分类。这类类型学在心理语言学中具有研究价值,因为它能部分预测新生儿区分语言的能力,并揭示成人听众如何处理非母语语言。尽管节奏度量在支持语言节奏类别存在方面取得一定成功,但定量研究尚未完全捕捉与语音节奏相关的时间规律性的全部复杂性。我们认为深度学习提供了一种强大的模式识别方法,可推进语音节奏声学基础的描述。为验证这一假设,我们训练了一个中等规模的递归神经网络,基于包含21种语言语音记录的大型数据库执行语言识别任务。网络可访问振幅包络和识别浊音段的变量,假设该信号难以传递语音信息但能保留韵律特征。该网络在40%的案例中正确识别10秒录音的语言,三分之二的案例中语言出现在前三候选结果中。可视化方法显示,网络激活构建的表征与语音节奏类型学一致,但生成的映射图比重音计时语言与音节计时语言之间的两个独立聚类更为复杂。我们进一步通过识别网络激活与已知语音节奏度量之间的相关性分析模型。研究结果揭示了深度学习工具识别和探索语言学相关声学特征空间、深化对语音节奏理解的潜力。